oldshensheep / v2ex_scrapy

scrapy for v2ex.com
https://www.v2ex.com/t/954480
MIT License
259 stars 55 forks source link

太容易403了 #3

Closed xinmans closed 1 year ago

xinmans commented 1 year ago

2023-07-06 14:50:53 [scrapy.core.engine] DEBUG: Crawled (403) <GET https://www.v2ex.com/t/326> (referer: None) 2023-07-06 14:50:53 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 https://www.v2ex.com/t/326>: HTTP status code is not handled or not allowed 2023-07-06 14:50:53 [scrapy.core.engine] DEBUG: Crawled (403) <GET https://www.v2ex.com/t/327> (referer: None) 2023-07-06 14:50:54 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 https://www.v2ex.com/t/327>: HTTP status code is not handled or not allowed 2023-07-06 14:50:54 [scrapy.core.engine] DEBUG: Crawled (403) <GET https://www.v2ex.com/t/328> (referer: None) 2023-07-06 14:50:54 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 https://www.v2ex.com/t/328>: HTTP status code is not handled or not allowed 2023-07-06 14:50:56 [scrapy.core.engine] DEBUG: Crawled (403) <GET https://www.v2ex.com/t/329> (referer: None) 2023-07-06 14:50:56 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 https://www.v2ex.com/t/329>: HTTP status code is not handled or not allowed

建议加一些伪造useragent等逻辑

oldshensheep commented 1 year ago

已经伪造了useragent,可以换一个代理IP,我套了个Cloudflare Warp的代理,基本上不会403

xinmans commented 1 year ago

已经伪造了useragent,可以换一个代理IP,我套了个Cloudflare Warp的代理,基本上不会403

你那个只伪造了一个,也很容易被封 cloudflare warp代理爬v2ex全站开销多少?

oldshensheep commented 1 year ago

随机useragent可以考虑,实现也比较简单。

cloudflare warp是免费的,不过不能直联。开销0

完成https://github.com/oldshensheep/v2ex_scrapy/commit/7f1a6f1820fa199f3cc0c694e1eb3c4959c9d862