Closed lime-n closed 2 years ago
Websites are designed to be user-friendly and not crawler-friendly. There can be a whole new lecture on why cookies are required but in short: All of the internet works on something called HTTP Protocol. This protocol is designed to be stateless which means the server will treat every incoming request as a new request. By design, it does not remember who you are. In order to make complex applications we have to make sure that the server DOES remember who you are. There are quite a number of ways in which the server is capable of remembering you. One of those ways is to use Cookies in the Request. That is the reason in some cases, you will need to simulate your Requests having Cookies in your Request header as well.
I had to include the cookies from the headers as an argument in
scrapy.FormRequest()
. […] when usingrequest.post()
I can get a response 200 by just using the payload and headers.
This sounds like something to look at, but you would have to provide a minimal reproducible example, written both with Scrapy and requests (but the same example in terms of input and expected output) to allow us to have a proper look.
However, mind that I find it hard to believe Scrapy is at fault here. While working on providing those examples, I suspect you will find the actual root cause of the failed response to be something other than “Scrapy needs cookies, requests does not”.
I had to include the cookies from the headers as an argument in
scrapy.FormRequest()
. […] when usingrequest.post()
I can get a response 200 by just using the payload and headers.This sounds like something to look at, but you would have to provide a minimal reproducible example, written both with Scrapy and requests (but the same example in terms of input and expected output) to allow us to have a proper look.
However, mind that I find it hard to believe Scrapy is at fault here. While working on providing those examples, I suspect you will find the actual root cause of the failed response to be something other than “Scrapy needs cookies, requests does not”.
I hope the first script I provided proves as 'minimal response', the bulk of the code is really just the headers and payload. Besides that, the rest is simple as I'm only sending a request and checking if I get a response. So it could really just be copied and runned. It should work as is - the issue I get is a response 400 which I wouldn't be able to figure out. Though it looks like a scrapy issue - can you check?
Here's the code for requests:
import requests
headers = {
'authority': 'www.etsy.com',
'sec-ch-ua': '" Not A;Brand";v="99", "Chromium";v="96", "Google Chrome";v="96"',
'x-csrf-token': '3:1641383062:Exn8HMFDcc0UtitU6NOM3o3x8BGB:864dc90d926383d90686f37be56f69685b939f0f306b10a99bcd9016209f15d4',
'sec-ch-ua-mobile': '?0',
'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36',
'content-type': 'application/x-www-form-urlencoded; charset=UTF-8',
'accept': '*/*',
'x-requested-with': 'XMLHttpRequest',
'x-page-guid': 'eeda48b359a.aa23cce28f31baac6f24.00',
'x-detected-locale': 'GBP|en-GB|GB',
'sec-ch-ua-platform': '"Linux"',
'origin': 'https://www.etsy.com',
'sec-fetch-site': 'same-origin',
'sec-fetch-mode': 'cors',
'sec-fetch-dest': 'empty',
'referer': 'https://www.etsy.com/search/clothing/womens-clothing?q=20s&explicit=1&ship_to=GB&page=2&ref=pagination',
'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8',
'cookie': 'uaid=G-_aWcvXqYHevnNO3ane9nOUmwNjZACCxCuVe2B0tVJpYmaKkpVSaVpUSoBZaGZVQL6Lj4mRv7ObrmmRR3F-aLyHp1ItAwA.; user_prefs=bNwL2wOEkWxqOSu2A1-CWlR6cr9jZACCxCuVe2B0tJK7U4CSTl5pTo6OUmqerruTko4SiACLGEEoXEQsAwA.; fve=1641314748.0; utm_lps=google__cpc; ua=531227642bc86f3b5fd7103a0c0b4fd6; p=eyJnZHByX3RwIjoxLCJnZHByX3AiOjF9; _gcl_au=1.1.1757627174.1641314793; _gid=GA1.2.1898390797.1641314793; __adal_cw=1641314793715; _pin_unauth=dWlkPVltVmtZemxoTldNdFpURXdPQzAwWkRWbUxXRTJOV1l0TTJGaE9URXdZVEEwTlRBeQ; last_browse_page=https%3A%2F%2Fwww.etsy.com%2Fuk%2F; __adal_ses=*; __adal_ca=so%3DGoogle%26me%3Dorganic%26ca%3D%28not%2520set%29%26co%3D%28not%2520set%29%26ke%3D%28not%2520set%29; search_options={"prev_search_term":"20s","item_language":null,"language_carousel":null}; _ga=GA1.2.559839679.1641314793; tsd=%7B%7D; __adal_id=952d43d7-5b80-4907-99d7-6f6baa9f4fe1.1641314794.3.1641383063.1641383059.2fe7a338-93bd-441f-b295-80549adbef7b; _tq_id.TV-27270909-1.a4d5=e2f6af8c27dee5e4.1641314794.0.1641383063..; _uetsid=dff577e06d7d11ec9617cbf4cc51b5b2; _uetvid=dff5f2706d7d11ec932fd3c5b816ab20; granify.uuid=bfd14e46-e8fa-4e7b-bce7-6f05dcb4b215; pla_spr=1; _ga_KR3J610VYM=GS1.1.1641383058.3.1.1641383118.60; exp_hangover=qk2fpkLi1lphuLsCKeq4gAe9BvxjZACCxCuVe8D01Zbb1UrlqUnxiUUlmWmZyZmJOfE5iSWpecmV8YUm8UYGhpZKVkqZeak5memZSTmpSrUMAA..; granify.session.QrsCf=-1',
}
data = {
'log_performance_metrics': 'true',
'specs[async_search_results][]': 'Search2_ApiSpecs_WebSearch',
'specs[async_search_results][1][search_request_params][detected_locale][language]': 'en-GB',
'specs[async_search_results][1][search_request_params][detected_locale][currency_code]': 'GBP',
'specs[async_search_results][1][search_request_params][detected_locale][region]': 'GB',
'specs[async_search_results][1][search_request_params][locale][language]': 'en-GB',
'specs[async_search_results][1][search_request_params][locale][currency_code]': 'GBP',
'specs[async_search_results][1][search_request_params][locale][region]': 'GB',
'specs[async_search_results][1][search_request_params][name_map][query]': 'q',
'specs[async_search_results][1][search_request_params][name_map][query_type]': 'qt',
'specs[async_search_results][1][search_request_params][name_map][results_per_page]': 'result_count',
'specs[async_search_results][1][search_request_params][name_map][min_price]': 'min',
'specs[async_search_results][1][search_request_params][name_map][max_price]': 'max',
'specs[async_search_results][1][search_request_params][parameters][q]': '20s',
'specs[async_search_results][1][search_request_params][parameters][explicit]': '1',
'specs[async_search_results][1][search_request_params][parameters][ship_to]': 'GB',
'specs[async_search_results][1][search_request_params][parameters][page]': '2',
'specs[async_search_results][1][search_request_params][parameters][ref]': 'pagination',
'specs[async_search_results][1][search_request_params][parameters][facet]': 'clothing/womens-clothing',
'specs[async_search_results][1][search_request_params][parameters][referrer]': 'https://www.etsy.com/search/clothing/womens-clothing?q=20s&explicit=1&ship_to=GB',
'specs[async_search_results][1][search_request_params][user_id]': '',
'specs[async_search_results][1][request_type]': 'pagination_preact',
'specs[async_search_results][1][is_eligible_for_spa_reformulations]': 'false',
'view_data_event_name': 'search_async_pagination_specview_rendered'
}
requests.post('https://www.etsy.com/api/v3/ajax/bespoke/member/neu/specs/async_search_results', headers=headers, data=data)
Your example cannot count as minimal. For one, it uses Splash.
Please, provide an example with only the minimum code needed to reproduce the issue. For the sake of the project, it is better for reporters to spend time on that, so that maintainers can focus their time on solving issues, not on trying to reproduce them.
Here's the bare minimum:
import scrapy
headers = {
'authority': 'www.etsy.com',
'sec-ch-ua': '" Not A;Brand";v="99", "Chromium";v="96", "Google Chrome";v="96"',
'x-csrf-token': '3:1641383062:Exn8HMFDcc0UtitU6NOM3o3x8BGB:864dc90d926383d90686f37be56f69685b939f0f306b10a99bcd9016209f15d4',
'sec-ch-ua-mobile': '?0',
'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36',
'content-type': 'application/x-www-form-urlencoded; charset=UTF-8',
'accept': '*/*',
'x-requested-with': 'XMLHttpRequest',
'x-page-guid': 'eeda48b359a.aa23cce28f31baac6f24.00',
'x-detected-locale': 'GBP|en-GB|GB',
'sec-ch-ua-platform': '"Linux"',
'origin': 'https://www.etsy.com',
'sec-fetch-site': 'same-origin',
'sec-fetch-mode': 'cors',
'sec-fetch-dest': 'empty',
'referer': 'https://www.etsy.com/search/clothing/womens-clothing?q=20s&explicit=1&ship_to=GB&page=2&ref=pagination',
'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8',
'cookie': 'uaid=G-_aWcvXqYHevnNO3ane9nOUmwNjZACCxCuVe2B0tVJpYmaKkpVSaVpUSoBZaGZVQL6Lj4mRv7ObrmmRR3F-aLyHp1ItAwA.; user_prefs=bNwL2wOEkWxqOSu2A1-CWlR6cr9jZACCxCuVe2B0tJK7U4CSTl5pTo6OUmqerruTko4SiACLGEEoXEQsAwA.; fve=1641314748.0; utm_lps=google__cpc; ua=531227642bc86f3b5fd7103a0c0b4fd6; p=eyJnZHByX3RwIjoxLCJnZHByX3AiOjF9; _gcl_au=1.1.1757627174.1641314793; _gid=GA1.2.1898390797.1641314793; __adal_cw=1641314793715; _pin_unauth=dWlkPVltVmtZemxoTldNdFpURXdPQzAwWkRWbUxXRTJOV1l0TTJGaE9URXdZVEEwTlRBeQ; last_browse_page=https%3A%2F%2Fwww.etsy.com%2Fuk%2F; __adal_ses=*; __adal_ca=so%3DGoogle%26me%3Dorganic%26ca%3D%28not%2520set%29%26co%3D%28not%2520set%29%26ke%3D%28not%2520set%29; search_options={"prev_search_term":"20s","item_language":null,"language_carousel":null}; _ga=GA1.2.559839679.1641314793; tsd=%7B%7D; __adal_id=952d43d7-5b80-4907-99d7-6f6baa9f4fe1.1641314794.3.1641383063.1641383059.2fe7a338-93bd-441f-b295-80549adbef7b; _tq_id.TV-27270909-1.a4d5=e2f6af8c27dee5e4.1641314794.0.1641383063..; _uetsid=dff577e06d7d11ec9617cbf4cc51b5b2; _uetvid=dff5f2706d7d11ec932fd3c5b816ab20; granify.uuid=bfd14e46-e8fa-4e7b-bce7-6f05dcb4b215; pla_spr=1; _ga_KR3J610VYM=GS1.1.1641383058.3.1.1641383118.60; exp_hangover=qk2fpkLi1lphuLsCKeq4gAe9BvxjZACCxCuVe8D01Zbb1UrlqUnxiUUlmWmZyZmJOfE5iSWpecmV8YUm8UYGhpZKVkqZeak5memZSTmpSrUMAA..; granify.session.QrsCf=-1',
}
payload = {
'log_performance_metrics': 'true',
'specs[async_search_results][]': 'Search2_ApiSpecs_WebSearch',
'specs[async_search_results][1][search_request_params][detected_locale][language]': 'en-GB',
'specs[async_search_results][1][search_request_params][detected_locale][currency_code]': 'GBP',
'specs[async_search_results][1][search_request_params][detected_locale][region]': 'GB',
'specs[async_search_results][1][search_request_params][locale][language]': 'en-GB',
'specs[async_search_results][1][search_request_params][locale][currency_code]': 'GBP',
'specs[async_search_results][1][search_request_params][locale][region]': 'GB',
'specs[async_search_results][1][search_request_params][name_map][query]': 'q',
'specs[async_search_results][1][search_request_params][name_map][query_type]': 'qt',
'specs[async_search_results][1][search_request_params][name_map][results_per_page]': 'result_count',
'specs[async_search_results][1][search_request_params][name_map][min_price]': 'min',
'specs[async_search_results][1][search_request_params][name_map][max_price]': 'max',
'specs[async_search_results][1][search_request_params][parameters][q]': '30s',
'specs[async_search_results][1][search_request_params][parameters][explicit]': '1',
'specs[async_search_results][1][search_request_params][parameters][locationQuery]': '2635167',
'specs[async_search_results][1][search_request_params][parameters][ship_to]': 'GB',
'specs[async_search_results][1][search_request_params][parameters][page]': '4',
'specs[async_search_results][1][search_request_params][parameters][ref]': 'pagination',
'specs[async_search_results][1][search_request_params][parameters][facet]': 'clothing/womens-clothing',
'specs[async_search_results][1][search_request_params][parameters][referrer]': 'https://www.etsy.com/search/clothing/womens-clothing?q=30s&explicit=1locationQuery=2635167&ship_to=GB&page=3&ref=pagination',
'specs[async_search_results][1][search_request_params][user_id]': '',
'specs[async_search_results][1][request_type]': 'pagination_preact',
'specs[async_search_results][1][is_eligible_for_spa_reformulations]': 'false',
'view_data_event_name': 'search_async_pagination_specview_rendered'
}
class EtsySpider(scrapy.Spider):
name = 'fails'
def start_requests(self):
yield scrapy.FormRequest(
url = 'https://www.etsy.com/api/v3/ajax/bespoke/member/neu/specs/async_search_results',
method = "POST",
formdata = payload,
headers=headers,
callback = self.parse
)
Output:
Crawled (400) <POST https://www.etsy.com/api/v3/ajax/bespoke/member/neu/specs/async_search_results> (referer: https://www.etsy.com/search/clothing/womens-clothing?q=20s&explicit=1&ship_to=GB&page=2&ref=pagination)
Here's the cookies to check that it works after:
cookies = {
"user_prefs": "2sjEL59UUglDjNIW6TKc04MvLTVjZACCxJMbvsPoaKXQYBclnbzSnBwdpdQ83dBgJR2lUEeoiBGEwkXEMgAA",
"fve": "1640607991.0",
"ua": "531227642bc86f3b5fd7103a0c0b4fd6",
"_gcl_au": "1.1.717562651.1640607992",
"uaid": "E7bYwrWVwTy7YGe_b_ipYT3Avd9jZACCxJMbvoPpqwvzqpVKEzNTlKyUnLJ9Io3DTQt1k53MwiojXTLzvZPCS31yCoPC_JRqGQA.",
"pla_spr": "0",
"_gid": "GA1.2.1425785976.1641390447",
"_dc_gtm_UA-2409779-1": "1",
"_pin_unauth": "dWlkPU0yVTRaamxoTWpjdFlqTTVZUzAwT0RJeExXRmpNamt0WlROalpXTTVNREE0WkRVNQ",
"_ga": "GA1.1.1730759327.1640607993",
"_uetsid": "052ece906e2e11ecb56a0390ed629376",
"_uetvid": "39de7550671011ec80d2dbfaa05c901b",
"exp_hangover": "pB4zSokzfzMIT9Jzi7zIwmXybCJjZACCxJMbvoPpqwt7qpXKU5PiE4tKMtMykzMTc-JzEktS85Ir4wtN4o0MDC2VrJQy81JzMtMzk3JSlWoZAA..",
"_ga_KR3J610VYM": "GS1.1.1641390446.2.1.1641390474.32"
}
The first script is pretty much the same as the requests
so this should work as 'bare-minimum', as it's only the import of scrapy, the headers, payload and start_requests
. I hope this serves you well!
I have just realized that you are sending the cookies in the headers in requests.
So the problem is that in Scrapy the cookies do not work if you set them through the headers parameter, but do work if set through the cookies parameter?
Yes, that's the issue I was facing. I couldn't understand why request accept the cookies in the headers but scrapy does not.
Closing as duplicate of #1992. Long story short, we fixed it on #2400 but that introduced another bug (#4717), so we reverted it. And by "we" I mean myself, I'm the one to blame :sweat_smile: We do mention it in the docs, though:
I had a problem earlier when trying to scrape a site. I thought there was an issue with requests elsewhere but It turned out that I had to include the cookies from the headers as an argument in
scrapy.FormRequest()
. What is the reasoning behind this? Because, when usingrequest.post()
I can get a response 200 by just using the payload and headers. Why does scrapy differ to this for this situation? I had thought it follows the same structure as requests in the backend, but it seems like there's more to it.For example, here's what I had which gives a 404 error:
And when I include the following I can crawl the information: