Repost issue - Githubissues

JunchengGuo commented 1 year ago

repost的抓取功能好像不能使用了，每次使用都只能抓取到几十条结果，请问这该如何解决呢？

nghuyong commented 1 year ago

有具体的例子吗

Lupopro commented 1 year ago

请问以主题来爬取的时候，关闭以小时切分时，为啥爬取500左右的数据就会自动停止呢？

nghuyong commented 1 year ago

最好给个具体的例子哈

JunchengGuo commented 1 year ago

有具体的例子吗

是可以用，但是对于大数据量大微博（比如6000的转发，）

    # 这里tweet_ids可替换成实际待采集的数据
    tweet_ids = ['Msz2Tjapl']

，只能爬取500-600条的转发微博，如果是小数量的也会有一定的减少。只有在100以下的时候才会完整爬下来。使用关键词的爬取数据都能正常使用。

JunchengGuo commented 1 year ago

{'_id': '4868086039316443', 'mblogid': 'MszebD5Dd', 'created_at': '2023-02-11 22:26:59', 'geo': None, 'ip_location': '发布于广东', 'reposts_count': 0, 'comments_count': 0, 'attitudes_count': 0, 'source': 'iQOO Z1 5G性能先锋', 'content': '转发微博', 'pic_urls': [], 'pic_num': 0, 'isLongText': False, 'user': {'_id': '5495215566', 'avatar_hd': 'https://tva3.sinaimg.cn/crop.0.1.1536.1536.1024/005ZTmB8jw8f9yvbxqiuxj316o16qdht.jpg?KID=imgbed,tva&Expires=1692633933&ssig=v8xZ%2FfrpV5', 'nick_name': '点点滴滴到底点点滴滴哒哒哒', 'verified': False, 'mbrank': 8, 'mbtype': 12}, 'video': 'http://f.video.weibocdn.com/o0/2TeF3VFGlx0834xf3mkU01041200pKQf0E010.mp4?label=mp4_ld&template=640x360.25.0&ori=0&ps=1CwnkDw1GXwCQx&Expires=1692626732&ssig=%2BvDFbOCDix&KID=unistore,video', 'url': 'https://weibo.com/5495215566/MszebD5Dd', 'crawl_time': 1692623133} 2023-08-21 21:05:34 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://weibo.com/ajax/statuses/repostTimeline?id=4868079034568243&page=28&moduleID=feed&count=10> (referer: https://weibo.com/ajax/statuses/repostTimeline?id=4868079034568243&page=27&moduleID=feed&count=10) 2023-08-21 21:05:34 [scrapy.core.engine] INFO: Closing spider (finished) 2023-08-21 21:05:34 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 38962, 'downloader/request_count': 28, 'downloader/request_method_count/GET': 28, 'downloader/response_bytes': 231082, 'downloader/response_count': 28, 'downloader/response_status_count/200': 28, 'elapsed_time_seconds': 33.940088, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2023, 8, 21, 13, 5, 34, 566645), 'httpcompression/response_bytes': 5113070, 'httpcompression/response_count': 27, 'item_scraped_count': 458, 'log_count/DEBUG': 487, 'log_count/INFO': 10, 'log_count/WARNING': 1, 'memusage/max': 56590336, 'memusage/startup': 56590336, 'request_depth_max': 27, 'response_received_count': 28, 'scheduler/dequeued': 28, 'scheduler/dequeued/memory': 28, 'scheduler/enqueued': 28, 'scheduler/enqueued/memory': 28, 'start_time': datetime.datetime(2023, 8, 21, 13, 5, 0, 626557)} 2023-08-21 21:05:34 [scrapy.core.engine] INFO: Spider closed (finished)

这是末尾的数据。

nghuyong commented 8 months ago

大数据量的转发从网页端正常浏览也是不能全部看到的，这是微博自身的限制

nghuyong / WeiboSpider

Repost issue #292