抓取公众号文章时，时间格式清洗出错

howie6879 / liuli

一站式构建多源、干净、个性化的阅读环境(Build a multi-source, clean and personalized reading environment in one stop.)

https://liuli.io

Apache License 2.0

889 stars 108 forks source link

抓取公众号文章时，时间格式清洗出错 #46

Closed showthesunli closed 2 years ago

showthesunli commented 2 years ago

测试脚本如下：

from src.collector.wechat_feddd.start import WeiXinSpider
WeiXinSpider.request_config = {"RETRIES": 3, "DELAY": 5, "TIMEOUT": 20}
WeiXinSpider.start_urls = ['https://mp.weixin.qq.com/s/OrCRVCZ8cGOLRf5p5avHOg']
WeiXinSpider.start()

错误原因：数据清洗时，期望的数据格式是 2022-03-21 20:59，但实际抓取回来的数据是 2022-03-22 20:37:12，导致 clean_doc_ts函数报错。如下图

showthesunli commented 2 years ago

如果把wechat_itme.py中的doc_ts抓取换成第47行，是可以正常抓取的，如下图

howie6879 commented 2 years ago

是 bug，时间提取将更换成从js脚本直接提取：

howie6879 commented 2 years ago

已修复，更新景镜像重新启动即可：

docker pull liuliio/schedule:v0.2.4