niuniuJQKKK / zhihu_crawler

本程序支持关键词搜索、热榜、用户信息、回答、专栏文章、评论等信息的抓取
26 stars 9 forks source link

无法爬取用户的回答 #1

Open funny-cat-happy opened 2 years ago

funny-cat-happy commented 2 years ago

问题

大佬,按照你的代码无法获取用户回答,只能获取用户基本信息。我就想你那样在run文件中只运行user_crawler 2022-04-24 10:10:53.824 | WARNING | zhihu_crawler.extractors:extract_data:526 - method: extract_user return: {'user_id': '4a69baf4e0a552d2047fabcc4501a0bb', 'user_name': '我们的太空', 'user_url_token': 'wo-men-de-tai-kong', 'user_head_img': 'https://pic2.zhimg.com/v2-af532a0c65340c09a4549e1e8194e050_l.jpg?source=32738c0c', 'user_is_org': True, 'user_headline': '太空不再高冷 知乎走近你我', 'user_type': 'people', 'user_is_active': True, 'user_description': '既然选择了太空 便只顾风雨兼程', 'user_is_advertiser': False, 'user_is_vip': False, 'user_badges': ['已认证账号', '优秀回答者'], 'user_follower_count': 1907822, 'user_following_count': 170, 'user_answer_count': 247, 'user_question_count': 82, 'user_articles_count': 2798, 'user_columns_count': 4, 'user_zvideo_count': 1585, 'user_pins_count': 1368, 'user_favorite_count': 1, 'user_favorited_count': 63434, 'user_reactions_count': 79890, 'user_shared_count': 0, 'user_voteup_count': 342047, 'user_thanked_count': 60174, 'user_following_columns_count': 1, 'user_following_topic_count': 14, 'user_following_question_count': 271, 'user_following_favlists_count': 0, 'user_participated_live_count': 1, 'user_included_answers_count': 36, 'user_included_articles_count': 33, 'user_recognized_count': 22, 'user_cover_url': 'https://pica.zhimg.com/v2-bbb942fe238dd540204fff9ce849cd2a_r.jpg?source=32738c0c', 'user_org_name': '我们的太空\n123847892739487123', 'user_org_industry': '党群政府-党群政府', 'user_org_url': '', 'user_org_lic_code': '123847892739487123'}

关于参数的问题

我按照你的代码自己写了一个程序,但是知乎一直返回参数异常,能否看一下问题。其中encypt文件未作改动

import hashlib
import os
import requests
import execjs
from encrypt import encrypt
import re

payload = {
    'include': 'data%5B*%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action'
               '%2Cannotation_detail%2Ccollapse_reason%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment'
               '%2Ccontent%2Ceditable_content%2Cattachment%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission'
               '%2Cmark_infos%2Ccreated_time%2Cupdated_time%2Creview_info%2Cexcerpt%2Cis_labeled%2Clabel_info'
               '%2Crelationship.is_authorized%2Cvoting%2Cis_author%2Cis_thanked%2Cis_nothelp%2Cis_recognized%3Bdata'
               '%5B*%5D.vessay_info%3Bdata%5B*%5D.author.badge%5B%3F%28type%3Dbest_answerer%29%5D.topics%3Bdata%5B'
               '*%5D.author.vip_info%3Bdata%5B*%5D.question.has_publishing_draft%2Crelationship',
    'offset': 0,
    'limit': 20,
    'sort_by': 'created'
}
proxies = {'http': 'http://localhost:8888', 'https':'http://localhost:8888'}

def get_headers(url):
    X_ZSE_93="101_3_2.0",
    sign, cookies = encrypt(X_ZSE_93, ''.join(re.sub(r'.*zhihu\.com', '', url)))
    headers = {
        'cookie': f'd_c0={cookies}',
        'user-agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.82 Safari/537.36",
        'x-zse-93': "101_3_2.0",
        'x-zse-96': sign,
    }
    return headers
response = requests.get(url='https://www.zhihu.com/api/v4/members/xiao-jie-jie-3-19/answers', params=payload,
                        headers=get_headers('https://www.zhihu.com/api/v4/members/xiao-jie-jie-3-19/answers'),proxies=proxies,verify=False)
niuniuJQKKK commented 2 years ago

问题

大佬,按照你的代码无法获取用户回答,只能获取用户基本信息。我就想你那样在run文件中只运行user_crawler 2022-04-24 10:10:53.824 | WARNING | zhihu_crawler.extractors:extract_data:526 - method: extract_user return: {'user_id': '4a69baf4e0a552d2047fabcc4501a0bb', 'user_name': '我们的太空', 'user_url_token': 'wo-men-de-tai-kong', 'user_head_img': 'https://pic2.zhimg.com/v2-af532a0c65340c09a4549e1e8194e050_l.jpg?source=32738c0c', 'user_is_org': True, 'user_headline': '太空不再高冷 知乎走近你我', 'user_type': 'people', 'user_is_active': True, 'user_description': '既然选择了太空 便只顾风雨兼程', 'user_is_advertiser': False, 'user_is_vip': False, 'user_badges': ['已认证账号', '优秀回答者'], 'user_follower_count': 1907822, 'user_following_count': 170, 'user_answer_count': 247, 'user_question_count': 82, 'user_articles_count': 2798, 'user_columns_count': 4, 'user_zvideo_count': 1585, 'user_pins_count': 1368, 'user_favorite_count': 1, 'user_favorited_count': 63434, 'user_reactions_count': 79890, 'user_shared_count': 0, 'user_voteup_count': 342047, 'user_thanked_count': 60174, 'user_following_columns_count': 1, 'user_following_topic_count': 14, 'user_following_question_count': 271, 'user_following_favlists_count': 0, 'user_participated_live_count': 1, 'user_included_answers_count': 36, 'user_included_articles_count': 33, 'user_recognized_count': 22, 'user_cover_url': 'https://pica.zhimg.com/v2-bbb942fe238dd540204fff9ce849cd2a_r.jpg?source=32738c0c', 'user_org_name': '我们的太空\n123847892739487123', 'user_org_industry': '党群政府-党群政府', 'user_org_url': '', 'user_org_lic_code': '123847892739487123'}

关于参数的问题

我按照你的代码自己写了一个程序,但是知乎一直返回参数异常,能否看一下问题。其中encypt文件未作改动

import hashlib
import os
import requests
import execjs
from encrypt import encrypt
import re

payload = {
    'include': 'data%5B*%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action'
               '%2Cannotation_detail%2Ccollapse_reason%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment'
               '%2Ccontent%2Ceditable_content%2Cattachment%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission'
               '%2Cmark_infos%2Ccreated_time%2Cupdated_time%2Creview_info%2Cexcerpt%2Cis_labeled%2Clabel_info'
               '%2Crelationship.is_authorized%2Cvoting%2Cis_author%2Cis_thanked%2Cis_nothelp%2Cis_recognized%3Bdata'
               '%5B*%5D.vessay_info%3Bdata%5B*%5D.author.badge%5B%3F%28type%3Dbest_answerer%29%5D.topics%3Bdata%5B'
               '*%5D.author.vip_info%3Bdata%5B*%5D.question.has_publishing_draft%2Crelationship',
    'offset': 0,
    'limit': 20,
    'sort_by': 'created'
}
proxies = {'http': 'http://localhost:8888', 'https':'http://localhost:8888'}

def get_headers(url):
    X_ZSE_93="101_3_2.0",
    sign, cookies = encrypt(X_ZSE_93, ''.join(re.sub(r'.*zhihu\.com', '', url)))
    headers = {
        'cookie': f'd_c0={cookies}',
        'user-agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.82 Safari/537.36",
        'x-zse-93': "101_3_2.0",
        'x-zse-96': sign,
    }
    return headers
response = requests.get(url='https://www.zhihu.com/api/v4/members/xiao-jie-jie-3-19/answers', params=payload,
                        headers=get_headers('https://www.zhihu.com/api/v4/members/xiao-jie-jie-3-19/answers'),proxies=proxies,verify=False)

如要获取回答,需answer_count 赋值 ;如 :

for info in user_crawler('wo-men-de-tai-kong', answer_count=50):
# 通过info['answers'] 可以获取回答列表;
answers = info['answers']

参数问题:知乎加密是需要将完整的各请求参数带上的。具体请参考constant.py中的常用请求URL

funny-cat-happy commented 2 years ago

大佬万分感谢,确实我的请求URL有问题。再顺便问一下为什么我抓包得到的是 https://www.zhihu.com/api/v4/members/{user_id}/answers 而你的却是 https://api.zhihu.com/members/{user_id}/answers 我没有碰到过这个请求,你是怎么得到的呢