srx-2000 / spider_collection

python爬虫,目前库存:网易云音乐歌曲爬取,B站视频爬取,知乎问答爬取,壁纸爬取,xvideos视频爬取,有声书爬取,微博爬虫,安居客信息爬取+数据可视化,哔哩哔哩视频封面提取器,ip代理池封装,知乎百万级用户爬虫+数据分析,github用户爬虫
MIT License
1.22k stars 221 forks source link

现在zhihuAnswerSpider里面的接口是掉不通了么?好像返回10003 code #17

Closed lyx-v closed 2 years ago

srx-2000 commented 2 years ago

对的,现在知乎那边获取提问答案的接口header也加了81,93之类的参数,导致知乎回答爬虫和用户爬虫都暂时用不了了,目前我正在尝试使用js逆向搞定那几个参数,如有进展,我会在这里通知你的

hacklu commented 2 years ago

mark

srx-2000 commented 2 years ago

现已更新,主要做了93,96的参数破解,81这个参数是可以通过接口绕过的

zhangchievil commented 2 years ago

我在本地跑的时候,在ctx1.call调用JS的时候,一直在报“execjs._exceptions.ProgramError: SyntaxError: 语法错误”。是不是需要把整个main.app.xxx.js文件完整的放过来?

srx-2000 commented 2 years ago

我在本地跑的时候,在ctx1.call调用JS的时候,一直在报“execjs._exceptions.ProgramError: SyntaxError: 语法错误”。是不是需要把整个main.app.xxx.js文件完整的放过来?

@zhangchievil 这个应该是不用把完整的main.app.xxx.js文件放入项目的,因为其中涉及参数破解的代码已经被我提出来放到/spider/g_encrypt.js文件中,如果可以的话,能把你的全部报错信息提供在这里嘛

zhangchievil commented 2 years ago

我在本地跑的时候,在ctx1.call调用JS的时候,一直在报“execjs._exceptions.ProgramError: SyntaxError: 语法错误”。是不是需要把整个main.app.xxx.js文件完整的放过来?

@zhangchievil 这个应该是不用把完整的main.app.xxx.js文件放入项目的,因为其中涉及参数破解的代码已经被我提出来放到/spider/g_encrypt.js文件中,如果可以的话,能把你的全部报错信息提供在这里嘛

请输入想要选取的模式:1.爬取单个问题 2.爬取相关问题 1 请输入想要爬取的问题的id,或相关问题的起点问题的id: 487602698

Traceback (most recent call last): File "D:/workspace/spider_collection/zhihuAnswerSpider/spider/zhihu_answer.py", line 221, in zhihu.single_answer(id) File "D:/workspace/spider_collection/zhihuAnswerSpider/spider/zhihu_answer.py", line 141, in single_answer question_title = self.get_question_title(question_id) File "D:/workspace/spider_collection/zhihuAnswerSpider/spider/zhihu_answer.py", line 134, in get_question_title response = self.proxy_pool.get(url, headers=self.get_headers(url), anonymity=False) File "D:/workspace/spider_collection/zhihuAnswerSpider/spider/zhihu_answer.py", line 46, in get_headers encryptstr = "2.0%s" % ctx1.call('b', fmd5) File "C:\Users\abc\AppData\Local\Continuum\anaconda3\envs\robot\lib\site-packages\execjs_abstract_runtime_context.py", line 37, in call return self._call(name, *args) File "C:\Users\abc\AppData\Local\Continuum\anaconda3\envs\robot\lib\site-packages\execjs_external_runtime.py", line 92, in _call return self._eval("{identifier}.apply(this, {args})".format(identifier=identifier, args=args)) File "C:\Users\abc\AppData\Local\Continuum\anaconda3\envs\robot\lib\site-packages\execjs_external_runtime.py", line 78, in eval return self.exec(code) File "C:\Users\abc\AppData\Local\Continuum\anaconda3\envs\robot\lib\site-packages\execjs_abstract_runtimecontext.py", line 18, in exec return self.exec(source) File "C:\Users\abc\AppData\Local\Continuum\anaconda3\envs\robot\lib\site-packages\execjs_external_runtime.py", line 88, in exec return self._extract_result(output) File "C:\Users\abc\AppData\Local\Continuum\anaconda3\envs\robot\lib\site-packages\execjs_external_runtime.py", line 167, in _extract_result raise ProgramError(value) execjs._exceptions.ProgramError: SyntaxError: 语法错误

信息十分有限,所以我确实比对了一下,g_encrypt.js中的内容和main.app.xxx.js的一致。我也没有检查是否我自己的环境有问题,因为没有什么思路。不过其他接口比如comments都还可用,所以我在考虑是否先使用selenium的方式了

srx-2000 commented 2 years ago

我在本地跑的时候,在ctx1.call调用JS的时候,一直在报“execjs._exceptions.ProgramError: SyntaxError: 语法错误”。是不是需要把整个main.app.xxx.js文件完整的放过来?

@zhangchievil 这个应该是不用把完整的main.app.xxx.js文件放入项目的,因为其中涉及参数破解的代码已经被我提出来放到/spider/g_encrypt.js文件中,如果可以的话,能把你的全部报错信息提供在这里嘛

请输入想要选取的模式:1.爬取单个问题 2.爬取相关问题 1 请输入想要爬取的问题的id,或相关问题的起点问题的id: 487602698

Traceback (most recent call last): File "D:/workspace/spider_collection/zhihuAnswerSpider/spider/zhihu_answer.py", line 221, in zhihu.single_answer(id) File "D:/workspace/spider_collection/zhihuAnswerSpider/spider/zhihu_answer.py", line 141, in single_answer question_title = self.get_question_title(question_id) File "D:/workspace/spider_collection/zhihuAnswerSpider/spider/zhihu_answer.py", line 134, in get_question_title response = self.proxy_pool.get(url, headers=self.get_headers(url), anonymity=False) File "D:/workspace/spider_collection/zhihuAnswerSpider/spider/zhihu_answer.py", line 46, in get_headers encryptstr = "2.0%s" % ctx1.call('b', fmd5) File "C:\Users\abc\AppData\Local\Continuum\anaconda3\envs\robot\lib\site-packages\execjs_abstract_runtime_context.py", line 37, in call return self._call(name, *args) File "C:\Users\abc\AppData\Local\Continuum\anaconda3\envs\robot\lib\site-packages\execjs_external_runtime.py", line 92, in _call return self._eval("{identifier}.apply(this, {args})".format(identifier=identifier, args=args)) File "C:\Users\abc\AppData\Local\Continuum\anaconda3\envs\robot\lib\site-packages\execjs_external_runtime.py", line 78, in eval return self.exec(code) File "C:\Users\abc\AppData\Local\Continuum\anaconda3\envs\robot\lib\site-packages\execjs_abstract_runtimecontext.py", line 18, in exec return self.exec(source) File "C:\Users\abc\AppData\Local\Continuum\anaconda3\envs\robot\lib\site-packages\execjs_external_runtime.py", line 88, in exec return self._extract_result(output) File "C:\Users\abc\AppData\Local\Continuum\anaconda3\envs\robot\lib\site-packages\execjs_external_runtime.py", line 167, in _extract_result raise ProgramError(value) execjs._exceptions.ProgramError: SyntaxError: 语法错误

信息十分有限,所以我确实比对了一下,g_encrypt.js中的内容和main.app.xxx.js的一致。我也没有检查是否我自己的环境有问题,因为没有什么思路。不过其他接口比如comments都还可用,所以我在考虑是否先使用selenium的方式了

嗯.....我这里换了台电脑试了试,如果nodejs环境和execjs环境没有配置错误是可以跑起来了,可能还真是你的环境问题

john20000625 commented 2 years ago

请教一个问题,在运行zhihu_answer.py第46行 encryptstr = "2.0%s" % ctx1.call('b', fmd5) 遇到报错: Exception has occurred: NotADirectoryError [WinError 267] 目录名称无效。 File "[C:\Users\12610\Desktop\zhihuAnswerSpider\spider\zhihu_answer.py]()", line 175, in get_headers encryptstr = "2.0%s" % ctx1.call('b', fmd5) File "[C:\Users\12610\Desktop\zhihuAnswerSpider\spider\zhihu_answer.py]()", line 266, in get_question_title url, headers=self.get_headers(url), anonymity=False) File "[C:\Users\12610\Desktop\zhihuAnswerSpider\spider\zhihu_answer.py]()", line 273, in single_answer question_title = self.get_question_title(question_id) File "[C:\Users\12610\Desktop\zhihuAnswerSpider\spider\zhihu_answer.py]()", line 356, in zhihu.single_answer(id) 请问有没有什么解决办法呢,我没有安装之前讨论提到的execjs环境。之前的选择模式是1,爬取id是401974073。麻烦了!

john20000625 commented 2 years ago

请教一个问题,在运行zhihu_answer.py第46行 encryptstr = "2.0%s" % ctx1.call('b', fmd5) 遇到报错: Exception has occurred: NotADirectoryError [WinError 267] 目录名称无效。 File "C:\Users\12610\Desktop\zhihuAnswerSpider\spider\zhihu_answer.py", line 175, in get_headers encryptstr = "2.0%s" % ctx1.call('b', fmd5) File "C:\Users\12610\Desktop\zhihuAnswerSpider\spider\zhihu_answer.py", line 266, in get_question_title url, headers=self.get_headers(url), anonymity=False) File "C:\Users\12610\Desktop\zhihuAnswerSpider\spider\zhihu_answer.py", line 273, in single_answer question_title = self.get_question_title(question_id) File "C:\Users\12610\Desktop\zhihuAnswerSpider\spider\zhihu_answer.py", line 356, in zhihu.single_answer(id) 请问有没有什么解决办法呢,我没有安装之前讨论提到的execjs环境。之前的选择模式是1,爬取id是401974073。麻烦了!

但requirement.txt是都安装了

srx-2000 commented 2 years ago

请教一个问题,在运行zhihu_answer.py第46行 encryptstr = "2.0%s" % ctx1.call('b', fmd5) 遇到报错: Exception has occurred: NotADirectoryError [WinError 267] 目录名称无效。 File "C:\Users\12610\Desktop\zhihuAnswerSpider\spider\zhihu_answer.py", line 175, in get_headers encryptstr = "2.0%s" % ctx1.call('b', fmd5) File "C:\Users\12610\Desktop\zhihuAnswerSpider\spider\zhihu_answer.py", line 266, in get_question_title url, headers=self.get_headers(url), anonymity=False) File "C:\Users\12610\Desktop\zhihuAnswerSpider\spider\zhihu_answer.py", line 273, in single_answer question_title = self.get_question_title(question_id) File "C:\Users\12610\Desktop\zhihuAnswerSpider\spider\zhihu_answer.py", line 356, in zhihu.single_answer(id) 请问有没有什么解决办法呢,我没有安装之前讨论提到的execjs环境。之前的选择模式是1,爬取id是401974073。麻烦了!

但requirement.txt是都安装了

首先需要确定你电脑上有nodejs环境,然后我看bug应该是目录结构问题这里你可以尝试将 https://github.com/srx-2000/spider_collection/blob/2a58ec2cfeee8a582809c51c2ddb385ffb5a41d6/zhihuAnswerSpider/spider/zhihu_answer.py#L150-L151 替换为: with open(os.path.dirname(os.path.dirname(__file__)) + os.sep + "result" + os.sep + question_title + ".txt",mode="w",encoding='utf-8') as f:

并将 https://github.com/srx-2000/spider_collection/blob/2a58ec2cfeee8a582809c51c2ddb385ffb5a41d6/zhihuAnswerSpider/spider/zhihu_answer.py#L45 替换为: ctx1 = execjs.compile(f.read(), cwd=os.path.dirname(os.path.dirname(__file__)) + os.sep + 'node_modules')

试试看,应该就可以解决问题了

john20000625 commented 2 years ago

请教一个问题,在运行zhihu_answer.py第46行 encryptstr = "2.0%s" % ctx1.call('b', fmd5) 遇到报错: Exception has occurred: NotADirectoryError [WinError 267] 目录名称无效。 File "C:\Users\12610\Desktop\zhihuAnswerSpider\spider\zhihu_answer.py", line 175, in get_headers encryptstr = "2.0%s" % ctx1.call('b', fmd5) File "C:\Users\12610\Desktop\zhihuAnswerSpider\spider\zhihu_answer.py", line 266, in get_question_title url, headers=self.get_headers(url), anonymity=False) File "C:\Users\12610\Desktop\zhihuAnswerSpider\spider\zhihu_answer.py", line 273, in single_answer question_title = self.get_question_title(question_id) File "C:\Users\12610\Desktop\zhihuAnswerSpider\spider\zhihu_answer.py", line 356, in zhihu.single_answer(id) 请问有没有什么解决办法呢,我没有安装之前讨论提到的execjs环境。之前的选择模式是1,爬取id是401974073。麻烦了!

但requirement.txt是都安装了

首先需要确定你电脑上有nodejs环境,然后我看bug应该是目录结构问题这里你可以尝试将

https://github.com/srx-2000/spider_collection/blob/2a58ec2cfeee8a582809c51c2ddb385ffb5a41d6/zhihuAnswerSpider/spider/zhihu_answer.py#L150-L151

替换为: with open(os.path.dirname(os.path.dirname(__file__)) + os.sep + "result" + os.sep + question_title + ".txt",mode="w",encoding='utf-8') as f: 并将

https://github.com/srx-2000/spider_collection/blob/2a58ec2cfeee8a582809c51c2ddb385ffb5a41d6/zhihuAnswerSpider/spider/zhihu_answer.py#L45

替换为: ctx1 = execjs.compile(f.read(), cwd=os.path.dirname(os.path.dirname(__file__)) + os.sep + 'node_modules') 试试看,应该就可以解决问题了

谢谢!输出目录您说的那个改动可以运行但是跑完找不到在哪里:/ (用everything也找不到)然后我把它替换成绝对路径就ok了。这次新的版本比上个版本快了不少诶,真的很牛!最后祝您考研出分顺利!

srx-2000 commented 2 years ago

请教一个问题,在运行zhihu_answer.py第46行 encryptstr = "2.0%s" % ctx1.call('b', fmd5) 遇到报错: Exception has occurred: NotADirectoryError [WinError 267] 目录名称无效。 File "C:\Users\12610\Desktop\zhihuAnswerSpider\spider\zhihu_answer.py", line 175, in get_headers encryptstr = "2.0%s" % ctx1.call('b', fmd5) File "C:\Users\12610\Desktop\zhihuAnswerSpider\spider\zhihu_answer.py", line 266, in get_question_title url, headers=self.get_headers(url), anonymity=False) File "C:\Users\12610\Desktop\zhihuAnswerSpider\spider\zhihu_answer.py", line 273, in single_answer question_title = self.get_question_title(question_id) File "C:\Users\12610\Desktop\zhihuAnswerSpider\spider\zhihu_answer.py", line 356, in zhihu.single_answer(id) 请问有没有什么解决办法呢,我没有安装之前讨论提到的execjs环境。之前的选择模式是1,爬取id是401974073。麻烦了!

但requirement.txt是都安装了

首先需要确定你电脑上有nodejs环境,然后我看bug应该是目录结构问题这里你可以尝试将 https://github.com/srx-2000/spider_collection/blob/2a58ec2cfeee8a582809c51c2ddb385ffb5a41d6/zhihuAnswerSpider/spider/zhihu_answer.py#L150-L151

替换为: with open(os.path.dirname(os.path.dirname(__file__)) + os.sep + "result" + os.sep + question_title + ".txt",mode="w",encoding='utf-8') as f: 并将 https://github.com/srx-2000/spider_collection/blob/2a58ec2cfeee8a582809c51c2ddb385ffb5a41d6/zhihuAnswerSpider/spider/zhihu_answer.py#L45

替换为: ctx1 = execjs.compile(f.read(), cwd=os.path.dirname(os.path.dirname(__file__)) + os.sep + 'node_modules') 试试看,应该就可以解决问题了

谢谢!输出目录您说的那个改动可以运行但是跑完找不到在哪里:/ (用everything也找不到)然后我把它替换成绝对路径就ok了。这次新的版本比上个版本快了不少诶,真的很牛!最后祝您考研出分顺利!

嗯嗯,谢谢肯定啦