Open CuriousG102 opened 9 years ago
Hmm, the form is with an empty action, so Scrapy assumes response.url
as target URL.
It seems the problem is that Scrapy is appending the parameters to the URL instead of overriding them -- I'm not sure if this is the desired in other cases.
You can workaround it by setting a clean URL:
from scrapy.http import FormRequest
from w3lib.url import url_query_cleaner
FormRequest.from_response(response,
formxpath='//div[@class="page-forward"]/form[1]',
url=url_query_cleaner(response.url))
Thank you so much for your help! I did not expect someone to literally provide me with the fix code.
In a little less than a month, I may have time to submit a pull request to fix this myself. But my suggestion for this scraping framework would be to override parameters already on the url. The expected behavior of a method like "from_response" would be that it emulates the behavior of a browser given that response. This behavior is confirmed for an end user by parameters like "click_data", which also imply an emulation of browser behavior.
This is still happening with scrapy 1.1.2
$ scrapy shell
2016-09-19 18:14:51 [scrapy] INFO: Scrapy 1.1.2 started (bot: scrapybot)
In [1]: fetch('http://utdirect.utexas.edu/ctl/ecis/results/index.WBX?s_in_page_isn=648311&s_in_page_query=Quesada+Gonzalez%2C+Carlos+20139MUS201M&s_in_max_nbr_return=0&s_in_search_query=Q&s_in_search_type_sw=N&s_in_page_direction=B&s_in_action_sw=P&s_in_search_name=Q')
2016-09-19 18:14:55 [scrapy] INFO: Spider opened
2016-09-19 18:14:57 [scrapy] DEBUG: Crawled (200) <GET http://utdirect.utexas.edu/ctl/ecis/results/index.WBX?s_in_page_isn=648311&s_in_page_query=Quesada+Gonzalez%2C+Carlos+20139MUS201M&s_in_max_nbr_return=0&s_in_search_query=Q&s_in_search_type_sw=N&s_in_page_direction=B&s_in_action_sw=P&s_in_search_name=Q> (referer: None)
In [3]: import scrapy
In [8]: from urllib.parse import parse_qs
In [9]: from urllib.parse import urlparse
In [12]: parse_qs(urlparse(response.url).query)
Out[12]:
{'s_in_action_sw': ['P'],
's_in_max_nbr_return': ['0'],
's_in_page_direction': ['B'],
's_in_page_isn': ['648311'],
's_in_page_query': ['Quesada Gonzalez, Carlos 20139MUS201M'],
's_in_search_name': ['Q'],
's_in_search_query': ['Q'],
's_in_search_type_sw': ['N']}
In [14]: frq = scrapy.http.FormRequest.from_response(response, formxpath='//div[@class="page-forward"]/form[1]')
In [15]: parse_qs(urlparse(frq.url).query)
Out[15]:
{'s_in_action_sw': ['P', 'P'],
's_in_max_nbr_return': ['0', '0'],
's_in_page_direction': ['B', 'F'],
's_in_page_isn': ['648311', '865912'],
's_in_page_query': ['Quesada Gonzalez, Carlos 20139MUS201M',
'Que, Emily 20162CH 431'],
's_in_search_name': ['Q', 'Q'],
's_in_search_query': ['Q', 'Q'],
's_in_search_type_sw': ['N', 'N']}
In [16]:
Hi there I want to extract data form [http://www.childrensplace.com/shop/us/p/boys-clothing/boys-clothing/boys-tops-and-boys-shirts/Boys-Short-Sleeve-Stripe-Pique-Polo-2083052-1363#close] . I want the get name of sizes using (sizes = response.xpath('//ul[@id="select_TCPSize"]') ) but it return empty list . what is problem with it? Is it post request? If yes how can I fetch that? Thanks in advance. @eliasdorneles
@saleemkhan17 ,
we try and keep GitHub issues for bug reports (and design or feature discussions)
Troubleshooting questions should be posted to either Stackoverflow (with tag "scrapy") or /r/scrapy.
Your issue does not look related to this open issue here. It's most certainly because <ul name="select_TCPSize" id="select_TCPSize">
is generated by JavaScript and Scrapy does not execute nor interpret JavaScript.
@redapple Kindly Sir could you suggest some alternative library etc.Thanks
@saleemkhan17 , please look for Scrapy and JavaScript on StackOverflow. One popular way is to use Scrapy + Splash + scrapy-splash.
Thanks
To reproduce: Open scrapy shell fetch('http://utdirect.utexas.edu/ctl/ecis/results/index.WBX?s_in_page_isn=648311&s_in_page_query=Quesada+Gonzalez%2C+Carlos+20139MUS201M&s_in_max_nbr_return=0&s_in_search_query=Q&s_in_search_type_sw=N&s_in_page_direction=B&s_in_action_sw=P&s_in_search_name=Q') scrapy.http.FormRequest.from_response(response, formxpath='//div[@class="page-forward"]/form[1]')
Compare get url to that produced when clicking the button in a browser.