howie6879 / liuli

一站式构建多源、干净、个性化的阅读环境(Build a multi-source, clean and personalized reading environment in one stop.)
https://liuli.io
Apache License 2.0
889 stars 108 forks source link

抓取公众号文章时,时间格式清洗出错 #46

Closed showthesunli closed 2 years ago

showthesunli commented 2 years ago

测试脚本如下:

from src.collector.wechat_feddd.start import WeiXinSpider
WeiXinSpider.request_config = {"RETRIES": 3, "DELAY": 5, "TIMEOUT": 20}
WeiXinSpider.start_urls = ['https://mp.weixin.qq.com/s/OrCRVCZ8cGOLRf5p5avHOg']
WeiXinSpider.start()

错误原因: 数据清洗时,期望的数据格式是 2022-03-21 20:59,但实际抓取回来的数据是 2022-03-22 20:37:12,导致 clean_doc_ts函数报错。如下图 image

showthesunli commented 2 years ago

如果把wechat_itme.py中的doc_ts抓取换成第47行,是可以正常抓取的,如下图 image

howie6879 commented 2 years ago

是 bug,时间提取将更换成从js脚本直接提取:

image

howie6879 commented 2 years ago

已修复,更新景镜像重新启动即可:

docker pull liuliio/schedule:v0.2.4