scrapy / scrapy

Scrapy, a fast high-level web crawling & scraping framework for Python.
https://scrapy.org
BSD 3-Clause "New" or "Revised" License
53.37k stars 10.59k forks source link

Scrapy form_response includes parameters not in form #1179

Open CuriousG102 opened 9 years ago

CuriousG102 commented 9 years ago

To reproduce: Open scrapy shell fetch('http://utdirect.utexas.edu/ctl/ecis/results/index.WBX?s_in_page_isn=648311&s_in_page_query=Quesada+Gonzalez%2C+Carlos+20139MUS201M&s_in_max_nbr_return=0&s_in_search_query=Q&s_in_search_type_sw=N&s_in_page_direction=B&s_in_action_sw=P&s_in_search_name=Q') scrapy.http.FormRequest.from_response(response, formxpath='//div[@class="page-forward"]/form[1]')

Compare get url to that produced when clicking the button in a browser.

eliasdorneles commented 9 years ago

Hmm, the form is with an empty action, so Scrapy assumes response.url as target URL. It seems the problem is that Scrapy is appending the parameters to the URL instead of overriding them -- I'm not sure if this is the desired in other cases.

You can workaround it by setting a clean URL:

from scrapy.http import FormRequest
from w3lib.url import url_query_cleaner

FormRequest.from_response(response,
                          formxpath='//div[@class="page-forward"]/form[1]',
                          url=url_query_cleaner(response.url))
CuriousG102 commented 9 years ago

Thank you so much for your help! I did not expect someone to literally provide me with the fix code.

In a little less than a month, I may have time to submit a pull request to fix this myself. But my suggestion for this scraping framework would be to override parameters already on the url. The expected behavior of a method like "from_response" would be that it emulates the behavior of a browser given that response. This behavior is confirmed for an end user by parameters like "click_data", which also imply an emulation of browser behavior.

redapple commented 8 years ago

This is still happening with scrapy 1.1.2

$ scrapy shell 
2016-09-19 18:14:51 [scrapy] INFO: Scrapy 1.1.2 started (bot: scrapybot)

In [1]: fetch('http://utdirect.utexas.edu/ctl/ecis/results/index.WBX?s_in_page_isn=648311&s_in_page_query=Quesada+Gonzalez%2C+Carlos+20139MUS201M&s_in_max_nbr_return=0&s_in_search_query=Q&s_in_search_type_sw=N&s_in_page_direction=B&s_in_action_sw=P&s_in_search_name=Q')
2016-09-19 18:14:55 [scrapy] INFO: Spider opened
2016-09-19 18:14:57 [scrapy] DEBUG: Crawled (200) <GET http://utdirect.utexas.edu/ctl/ecis/results/index.WBX?s_in_page_isn=648311&s_in_page_query=Quesada+Gonzalez%2C+Carlos+20139MUS201M&s_in_max_nbr_return=0&s_in_search_query=Q&s_in_search_type_sw=N&s_in_page_direction=B&s_in_action_sw=P&s_in_search_name=Q> (referer: None)

In [3]: import scrapy

In [8]: from urllib.parse import parse_qs

In [9]: from urllib.parse import urlparse

In [12]: parse_qs(urlparse(response.url).query)
Out[12]: 
{'s_in_action_sw': ['P'],
 's_in_max_nbr_return': ['0'],
 's_in_page_direction': ['B'],
 's_in_page_isn': ['648311'],
 's_in_page_query': ['Quesada Gonzalez, Carlos 20139MUS201M'],
 's_in_search_name': ['Q'],
 's_in_search_query': ['Q'],
 's_in_search_type_sw': ['N']}

In [14]: frq = scrapy.http.FormRequest.from_response(response, formxpath='//div[@class="page-forward"]/form[1]')

In [15]: parse_qs(urlparse(frq.url).query)
Out[15]: 
{'s_in_action_sw': ['P', 'P'],
 's_in_max_nbr_return': ['0', '0'],
 's_in_page_direction': ['B', 'F'],
 's_in_page_isn': ['648311', '865912'],
 's_in_page_query': ['Quesada Gonzalez, Carlos 20139MUS201M',
  'Que, Emily               20162CH 431'],
 's_in_search_name': ['Q', 'Q'],
 's_in_search_query': ['Q', 'Q'],
 's_in_search_type_sw': ['N', 'N']}

In [16]: 
saleemkhan17 commented 7 years ago

Hi there I want to extract data form [http://www.childrensplace.com/shop/us/p/boys-clothing/boys-clothing/boys-tops-and-boys-shirts/Boys-Short-Sleeve-Stripe-Pique-Polo-2083052-1363#close] . I want the get name of sizes using (sizes = response.xpath('//ul[@id="select_TCPSize"]') ) but it return empty list . what is problem with it? Is it post request? If yes how can I fetch that? Thanks in advance. @eliasdorneles

redapple commented 7 years ago

@saleemkhan17 , we try and keep GitHub issues for bug reports (and design or feature discussions) Troubleshooting questions should be posted to either Stackoverflow (with tag "scrapy") or /r/scrapy. Your issue does not look related to this open issue here. It's most certainly because <ul name="select_TCPSize" id="select_TCPSize"> is generated by JavaScript and Scrapy does not execute nor interpret JavaScript.

saleemkhan17 commented 7 years ago

@redapple Kindly Sir could you suggest some alternative library etc.Thanks

redapple commented 7 years ago

@saleemkhan17 , please look for Scrapy and JavaScript on StackOverflow. One popular way is to use Scrapy + Splash + scrapy-splash.

saleemkhan17 commented 7 years ago

Thanks