srx-2000 / spider_collection

python爬虫,目前库存:网易云音乐歌曲爬取,B站视频爬取,知乎问答爬取,壁纸爬取,xvideos视频爬取,有声书爬取,微博爬虫,安居客信息爬取+数据可视化,哔哩哔哩视频封面提取器,ip代理池封装,知乎百万级用户爬虫+数据分析,github用户爬虫
MIT License
1.22k stars 221 forks source link

大佬,可以修复下知乎爬虫吗 #34

Closed pwh-pwh closed 10 months ago

pwh-pwh commented 1 year ago

知乎爬虫的算法变了,现在用不了了

srx-2000 commented 1 year ago

嗯嗯,最近暂时没有时间,等一月份我统一将库里的爬虫全部翻新一遍,到时候在这个issue中回复提醒你

1901AsTrO1205 commented 11 months ago

函数parse_hot_list_and_save(self)中,生成热榜url的代码有问题 hot_url_list = [ f"https://www.zhihu.com/api/v4/questions/{question_id}/answers?" \ f"include=data%5B%2A%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2" \ f"Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2" \ f"Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cattachment%2Cvoteup_count%2Creshipment_settings%2" \ f"Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2Crelevant_info%2Cquestion%2Cexcerpt%2Cis_labeled%2" \ f"Cpaid_info%2Cpaid_info_content%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Cis_recognized%3" \ f"Bdata%5B%2A%5D.mark_infos%5B%2A%5D.url%3Bdata%5B%2A%5D.author.follower_count%2Cvip_info%2Cbadge%5B%2A%5D.topics%3Bdata%5B%2A%5D." \ f"settings.table_of_content.enabled&limit={self.__LIMIT}&offset=0&platform=desktop&sort_by=default" for question_id in hot_list] 这个生成的链接现在会报请求参数异常10003,请问生成这个URL的原理是什么呀,大佬可以修复一下嘛

srx-2000 commented 11 months ago

函数parse_hot_list_and_save(self)中,生成热榜url的代码有问题 hot_url_list = [ f"https://www.zhihu.com/api/v4/questions/{question_id}/answers?" f"include=data%5B%2A%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2" f"Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2" f"Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cattachment%2Cvoteup_count%2Creshipment_settings%2" f"Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2Crelevant_info%2Cquestion%2Cexcerpt%2Cis_labeled%2" f"Cpaid_info%2Cpaid_info_content%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Cis_recognized%3" f"Bdata%5B%2A%5D.mark_infos%5B%2A%5D.url%3Bdata%5B%2A%5D.author.follower_count%2Cvip_info%2Cbadge%5B%2A%5D.topics%3Bdata%5B%2A%5D." f"settings.table_of_content.enabled&limit={self.__LIMIT}&offset=0&platform=desktop&sort_by=default" for question_id in hot_list] 这个生成的链接现在会报请求参数异常10003,请问生成这个URL的原理是什么呀,大佬可以修复一下嘛

已经上传了新的加密文件已上传,见这里:https://github.com/srx-2000/spider_collection/blob/a9fdcc9ed6b06d1bfc5ba3685d243cdaa5842b11/zhihuEncrypt/zhihu_encrypt.py

srx-2000 commented 11 months ago

知乎爬虫的算法变了,现在用不了了

不知道你现在还需不需要,我这边已经更新了新的算法,详细的见这里:zhihuEncrypt/zhihu_encrypt.py