Open JohnMTrimbleIII opened 6 years ago
Ok I've noticed the log on my computer looks like a timeout error now.
2018-07-01 11:46:28 [manager] DEBUG: PAGE_REQUEST_ERROR url=http://127.0.0.1:8050/render.html error=TimeoutError
2018-07-01 11:46:28 [manager.components] DEBUG: processing 'request_error' 'Middleware.UrlFingerprintMiddleware' <Request at 0x110da5710 http://127.0.0.1:8050/render.html meta={b'scrapy_callback': None, b'scrapy_errback': None, b'scrapy_meta': {'depth': 6, 'splash': {'endpoint': 'render.html', 'slot_policy': 'per_domain', 'magic_response': True, 'http_status_from_error_code': True, 'args': {'url': 'https://www.amazon.com/b/ref=s9_acss_bw_cg_LH18AES_1a1_w?node=17728828011&pf_rd_m=ATVPDKIKX0DER&pf_rd_s=merchandised-search-12&pf_rd_r=H90K0CDC94PWSA576F2W&pf_rd_t=101&pf_rd_p=39785ce6-6f12-4c9e-8259-0566bbb7447e&pf_rd_i=17728536011', 'wait': 0.5, 'headers': {'Referer': 'https://www.amazon.com/b/ref=nav_shopall_SWMTVT18?_encoding=UTF8&node=17728536011&pf_rd_p=bd34c6e7-aef4-4799-853a-d3b52e17360a&pf_rd_s=nav-sa-toys-kids-baby&pf_rd_t=4201&pf_rd_i=navbar-4201&pf_rd_m=ATVPDKIKX0DER&pf_rd_r=CSYHPA14129HPJS57ZBA', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'Accept-Language': 'en', 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:16.0.1) Gecko/20121011 Firefox/21.0.1'}}}, 'ajax_crawlable': True, 'proxy': 'https://83.149.70.159:13012', 'download_timeout': 180.0, '_splash_processed': True, 'download_slot': 'www.amazon.com'}, b'origin_is_frontier': True, b'fingerprint': b'de4c43938b44c012e828a5bd3d0f0d856a24c9b7', b'domain': {b'netloc': b'www.amazon.com', b'name': b'www.amazon.com', b'scheme': b'https', b'sld': b'', b'tld': b'', b'subdomain': b'', b'fingerprint': b'47cf2c8c9947e99f6ecfee54a51bd0225f376ebc'}, b'state': 1, b'score': 0.15625, b'jid': 0} body=b'{\n "headers": {\n '... cookies={}, headers={b'Content-Type': [b'application/json'], b'Accept': [b'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'], b'Accept-Language': [b'en'], b'User-Agent': [b'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1500.55 Safari/537.36'], b'Accept-Encoding': [b'gzip,deflate']}>
2018-07-01 11:46:28 [manager.components] DEBUG: processing 'request_error' 'Middleware.DomainMiddleware' <Request at 0x110da5710 http://127.0.0.1:8050/render.html meta={b'scrapy_callback': None, b'scrapy_errback': None, b'scrapy_meta': {'depth': 6, 'splash': {'endpoint': 'render.html', 'slot_policy': 'per_domain', 'magic_response': True, 'http_status_from_error_code': True, 'args': {'url': 'https://www.amazon.com/b/ref=s9_acss_bw_cg_LH18AES_1a1_w?node=17728828011&pf_rd_m=ATVPDKIKX0DER&pf_rd_s=merchandised-search-12&pf_rd_r=H90K0CDC94PWSA576F2W&pf_rd_t=101&pf_rd_p=39785ce6-6f12-4c9e-8259-0566bbb7447e&pf_rd_i=17728536011', 'wait': 0.5, 'headers': {'Referer': 'https://www.amazon.com/b/ref=nav_shopall_SWMTVT18?_encoding=UTF8&node=17728536011&pf_rd_p=bd34c6e7-aef4-4799-853a-d3b52e17360a&pf_rd_s=nav-sa-toys-kids-baby&pf_rd_t=4201&pf_rd_i=navbar-4201&pf_rd_m=ATVPDKIKX0DER&pf_rd_r=CSYHPA14129HPJS57ZBA', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'Accept-Language': 'en', 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:16.0.1) Gecko/20121011 Firefox/21.0.1'}}}, 'ajax_crawlable': True, 'proxy': 'https://83.149.70.159:13012', 'download_timeout': 180.0, '_splash_processed': True, 'download_slot': 'www.amazon.com'}, b'origin_is_frontier': True, b'fingerprint': b'd7b7e7a88bba0c4966f742b12dfb5dede8353f80', b'domain': {b'netloc': b'www.amazon.com', b'name': b'www.amazon.com', b'scheme': b'https', b'sld': b'', b'tld': b'', b'subdomain': b'', b'fingerprint': b'47cf2c8c9947e99f6ecfee54a51bd0225f376ebc'}, b'state': 1, b'score': 0.15625, b'jid': 0} body=b'{\n "headers": {\n '... cookies={}, headers={b'Content-Type': [b'application/json'], b'Accept': [b'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'], b'Accept-Language': [b'en'], b'User-Agent': [b'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1500.55 Safari/537.36'], b'Accept-Encoding': [b'gzip,deflate']}>
2018-07-01 11:46:28 [manager.components] DEBUG: processing 'request_error' 'Middleware.DomainFingerprintMiddleware' <Request at 0x110da5710 http://127.0.0.1:8050/render.html meta={b'scrapy_callback': None, b'scrapy_errback': None, b'scrapy_meta': {'depth': 6, 'splash': {'endpoint': 'render.html', 'slot_policy': 'per_domain', 'magic_response': True, 'http_status_from_error_code': True, 'args': {'url': 'https://www.amazon.com/b/ref=s9_acss_bw_cg_LH18AES_1a1_w?node=17728828011&pf_rd_m=ATVPDKIKX0DER&pf_rd_s=merchandised-search-12&pf_rd_r=H90K0CDC94PWSA576F2W&pf_rd_t=101&pf_rd_p=39785ce6-6f12-4c9e-8259-0566bbb7447e&pf_rd_i=17728536011', 'wait': 0.5, 'headers': {'Referer': 'https://www.amazon.com/b/ref=nav_shopall_SWMTVT18?_encoding=UTF8&node=17728536011&pf_rd_p=bd34c6e7-aef4-4799-853a-d3b52e17360a&pf_rd_s=nav-sa-toys-kids-baby&pf_rd_t=4201&pf_rd_i=navbar-4201&pf_rd_m=ATVPDKIKX0DER&pf_rd_r=CSYHPA14129HPJS57ZBA', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'Accept-Language': 'en', 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:16.0.1) Gecko/20121011 Firefox/21.0.1'}}}, 'ajax_crawlable': True, 'proxy': 'https://83.149.70.159:13012', 'download_timeout': 180.0, '_splash_processed': True, 'download_slot': 'www.amazon.com'}, b'origin_is_frontier': True, b'fingerprint': b'd7b7e7a88bba0c4966f742b12dfb5dede8353f80', b'domain': {b'netloc': b'127.0.0.1:8050', b'name': b'127.0.0.1', b'scheme': b'http', b'sld': b'', b'tld': b'', b'subdomain': b''}, b'state': 1, b'score': 0.15625, b'jid': 0} body=b'{\n "headers": {\n '... cookies={}, headers={b'Content-Type': [b'application/json'], b'Accept': [b'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'], b'Accept-Language': [b'en'], b'User-Agent': [b'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1500.55 Safari/537.36'], b'Accept-Encoding': [b'gzip,deflate']}>
2018-07-01 11:46:28 [manager.components] DEBUG: processing 'request_error' 'CanonicalSolver.BasicCanonicalSolver' <Request at 0x110da5710 http://127.0.0.1:8050/render.html meta={b'scrapy_callback': None, b'scrapy_errback': None, b'scrapy_meta': {'depth': 6, 'splash': {'endpoint': 'render.html', 'slot_policy': 'per_domain', 'magic_response': True, 'http_status_from_error_code': True, 'args': {'url': 'https://www.amazon.com/b/ref=s9_acss_bw_cg_LH18AES_1a1_w?node=17728828011&pf_rd_m=ATVPDKIKX0DER&pf_rd_s=merchandised-search-12&pf_rd_r=H90K0CDC94PWSA576F2W&pf_rd_t=101&pf_rd_p=39785ce6-6f12-4c9e-8259-0566bbb7447e&pf_rd_i=17728536011', 'wait': 0.5, 'headers': {'Referer': 'https://www.amazon.com/b/ref=nav_shopall_SWMTVT18?_encoding=UTF8&node=17728536011&pf_rd_p=bd34c6e7-aef4-4799-853a-d3b52e17360a&pf_rd_s=nav-sa-toys-kids-baby&pf_rd_t=4201&pf_rd_i=navbar-4201&pf_rd_m=ATVPDKIKX0DER&pf_rd_r=CSYHPA14129HPJS57ZBA', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'Accept-Language': 'en', 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:16.0.1) Gecko/20121011 Firefox/21.0.1'}}}, 'ajax_crawlable': True, 'proxy': 'https://83.149.70.159:13012', 'download_timeout': 180.0, '_splash_processed': True, 'download_slot': 'www.amazon.com'}, b'origin_is_frontier': True, b'fingerprint': b'd7b7e7a88bba0c4966f742b12dfb5dede8353f80', b'domain': {b'netloc': b'127.0.0.1:8050', b'name': b'127.0.0.1', b'scheme': b'http', b'sld': b'', b'tld': b'', b'subdomain': b'', b'fingerprint': b'4b84b15bff6ee5796152495a230e45e3d7e947d9'}, b'state': 1, b'score': 0.15625, b'jid': 0} body=b'{\n "headers": {\n '... cookies={}, headers={b'Content-Type': [b'application/json'], b'Accept': [b'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'], b'Accept-Language': [b'en'], b'User-Agent': [b'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1500.55 Safari/537.36'], b'Accept-Encoding': [b'gzip,deflate']}>
2018-07-01 11:46:28 [manager.components] DEBUG: processing 'request_error' 'Backend.MessageBusBackend' <Request at 0x110da5710 http://127.0.0.1:8050/render.html meta={b'scrapy_callback': None, b'scrapy_errback': None, b'scrapy_meta': {'depth': 6, 'splash': {'endpoint': 'render.html', 'slot_policy': 'per_domain', 'magic_response': True, 'http_status_from_error_code': True, 'args': {'url': 'https://www.amazon.com/b/ref=s9_acss_bw_cg_LH18AES_1a1_w?node=17728828011&pf_rd_m=ATVPDKIKX0DER&pf_rd_s=merchandised-search-12&pf_rd_r=H90K0CDC94PWSA576F2W&pf_rd_t=101&pf_rd_p=39785ce6-6f12-4c9e-8259-0566bbb7447e&pf_rd_i=17728536011', 'wait': 0.5, 'headers': {'Referer': 'https://www.amazon.com/b/ref=nav_shopall_SWMTVT18?_encoding=UTF8&node=17728536011&pf_rd_p=bd34c6e7-aef4-4799-853a-d3b52e17360a&pf_rd_s=nav-sa-toys-kids-baby&pf_rd_t=4201&pf_rd_i=navbar-4201&pf_rd_m=ATVPDKIKX0DER&pf_rd_r=CSYHPA14129HPJS57ZBA', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'Accept-Language': 'en', 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:16.0.1) Gecko/20121011 Firefox/21.0.1'}}}, 'ajax_crawlable': True, 'proxy': 'https://83.149.70.159:13012', 'download_timeout': 180.0, '_splash_processed': True, 'download_slot': 'www.amazon.com'}, b'origin_is_frontier': True, b'fingerprint': b'd7b7e7a88bba0c4966f742b12dfb5dede8353f80', b'domain': {b'netloc': b'127.0.0.1:8050', b'name': b'127.0.0.1', b'scheme': b'http', b'sld': b'', b'tld': b'', b'subdomain': b'', b'fingerprint': b'4b84b15bff6ee5796152495a230e45e3d7e947d9'}, b'state': 1, b'score': 0.15625, b'jid': 0} body=b'{\n "headers": {\n '... cookies={}, headers={b'Content-Type': [b'application/json'], b'Accept': [b'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'], b'Accept-Language': [b'en'], b'User-Agent': [b'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1500.55 Safari/537.36'], b'Accept-Encoding': [b'gzip,deflate']}>
Hi @DiscipleOfOne the right approach will be to use this guide https://github.com/scrapy-plugins/scrapy-splash#configuration , and Scrapy.Request with splash
meta key.
I have added the following lines to my spider.
From my debugging these lines are only executed when the seed is run. Frontera is creating the request and responses outside of the spider. Specifically in frontera/contrib/requests/converters.py.. I found these lines. mentioned in #300 .
I've edited those two function in virtualenv/lib to see if I could change the functionality. Just changing to SplashRequest and SplashResponse, but the problem is that the parse method in the spider is no longer called. Am I doing this wrong? I'm wanting to make sure my crawler is efficient so I'm wanting to avoid downloading pages more than once if I can.