scrapinghub / portia

Visual scraping for Scrapy
BSD 3-Clause "New" or "Revised" License
9.25k stars 1.41k forks source link

About 403 #813

Open Android-sunshine opened 7 years ago

Android-sunshine commented 7 years ago

LOG: ` Omitted before

INFO:scrapy.core.engine:Spider opened 2017-08-01 09:31:32 [scrapy] INFO: Spider opened INFO:scrapy.extensions.logstats:Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2017-08-01 09:31:32 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) DEBUG:scrapy.extensions.telnet:Telnet console listening on 127.0.0.1:6023 2017-08-01 09:31:32 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023 DEBUG:scrapy.core.engine:Crawled (403) <GET https://www.douban.com/photos/album/79005353/> (referer: None) 2017-08-01 09:31:32 [scrapy] DEBUG: Crawled (403) <GET https://www.douban.com/photos/album/79005353/> (referer: None) DEBUG:scrapy.spidermiddlewares.httperror:Ignoring response <403 https://www.douban.com/photos/album/79005353/>: HTTP status code is not handled or not allowed 2017-08-01 09:31:32 [scrapy] DEBUG: Ignoring response <403 https://www.douban.com/photos/album/79005353/>: HTTP status code is not handled or not allowed INFO:scrapy.core.engine:Closing spider (finished) 2017-08-01 09:31:32 [scrapy] INFO: Closing spider (finished) INFO:scrapy.statscollectors:Dumping Scrapy stats: {'downloader/request_bytes': 234, 'downloader/request_count': 1, 'downloader/request_method_count/GET': 1, 'downloader/response_bytes': 248, 'downloader/response_count': 1, 'downloader/response_status_count/403': 1, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2017, 8, 1, 9, 31, 32, 634481), 'log_count/DEBUG': 3, 'log_count/INFO': 7, 'log_count/WARNING': 1, 'response_received_count': 1, 'scheduler/dequeued': 1, 'scheduler/dequeued/memory': 1, 'scheduler/enqueued': 1, 'scheduler/enqueued/memory': 1, 'start_time': datetime.datetime(2017, 8, 1, 9, 31, 32, 488088)} 2017-08-01 09:31:32 [scrapy] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 234, 'downloader/request_count': 1, 'downloader/request_method_count/GET': 1, 'downloader/response_bytes': 248, 'downloader/response_count': 1, 'downloader/response_status_count/403': 1, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2017, 8, 1, 9, 31, 32, 634481), 'log_count/DEBUG': 3, 'log_count/INFO': 7, 'log_count/WARNING': 1, 'response_received_count': 1, 'scheduler/dequeued': 1, 'scheduler/dequeued/memory': 1, 'scheduler/enqueued': 1, 'scheduler/enqueued/memory': 1, 'start_time': datetime.datetime(2017, 8, 1, 9, 31, 32, 488088)} INFO:scrapy.core.engine:Spider closed (finished) 2017-08-01 09:31:32 [scrapy] INFO: Spider closed (finished)

After the omission `

This error is UA is not set of causes, but I don't know where to set the UA

Forgive me for just a amateur

ramiroluz commented 3 years ago

I am in the same page, another site. Looks like we need to authenticate, kind of access an index session url and get a token, I believe we need to pass this token to the following requests. Not sure yet.