scrapy / scrapy

Scrapy, a fast high-level web crawling & scraping framework for Python.
https://scrapy.org
BSD 3-Clause "New" or "Revised" License
51.16k stars 10.35k forks source link

[MRG+1] Issue #2919: Fix FormRequest.formdata with GET method duplicates same key in query string #3579

Closed maramsumanth closed 2 years ago

maramsumanth commented 5 years ago

Solved https://github.com/scrapy/scrapy/issues/2919.

Combining querystr and formdata into dictionary, and thereby eliminating the duplicates. @kmike , @lopuhin , please review. I have also corrected the code, by creating a list where we want to pass duplicate keys. as described in comment https://github.com/scrapy/scrapy/pull/3269#issuecomment-412260614

I am currently preparing for GSoC 2019, please guide me if there are any mistakes/suggestions. I would be glad if you give me suggestions for getting selected in scrapy. Thanks in advance :) πŸ‘

codecov[bot] commented 5 years ago

Codecov Report

Merging #3579 (66e2004) into master (9f81de2) will not change coverage. The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master    #3579   +/-   ##
=======================================
  Coverage   88.18%   88.18%           
=======================================
  Files         162      162           
  Lines       10490    10490           
  Branches     1516     1516           
=======================================
  Hits         9251     9251           
  Misses        965      965           
  Partials      274      274           
Impacted Files Coverage Ξ”
scrapy/http/request/form.py 95.48% <100.00%> (ΓΈ)
maramsumanth commented 5 years ago

@lopuhin , what can be the reason here for failure of codecov/patch ?

elacuesta commented 5 years ago

I couldn't take a deep look into the algorithm itself, these are just a few Python suggestions to improve readibility:

  1. There are some duplicated occurences of urlsplit(self.url), you could assign that to a variable instead

  2. It's better to iterate sequences directly instead of by index, i.e.

    
    In [1]: keys = ['a', 'b', 'c']                                                                                                                                                                                                                                                  

In [2]: for k in keys: # good ...: print(k)
a b c

In [3]: for i in range(len(keys)): # not so good ...: print(keys[i])
a b c


3. If you need to iterate at the same time over more than one iterator you could use [`zip`](https://docs.python.org/3.6/library/functions.html#zip)
```python
In [4]: values = [1, 2, 3]                                                                                                                                                                                                                                                      

In [5]: for k, v in zip(keys, values): 
   ...:     print('{}={}'.format(k, v))                                                                                                                                                                                                                                         
a=1
b=2
c=3

Finally, if I'm not mistaken the Codecov failure is because some of the new lines are not hit by any of the existing tests, that's why it says "will decrease coverage by 0.04%"

maramsumanth commented 5 years ago

@elacuesta , thanks for the suggestion :) πŸ‘. I have modified code to improve readability.

maramsumanth commented 5 years ago

Done @amarynets. Thanks for your suggestions! :) πŸ‘

maramsumanth commented 5 years ago

@lopuhin , can you please look into this?

lopuhin commented 5 years ago
http://www.example.com/?a=0&a=2&b=1&c=3#fragment
data = {'a': 1}

Looks like it's wrong request. I tested it by Postman and got 400 status code As far as I know all data from GET form should be sent by query string and duplicates of keys can exist because on the server it will be transformed to list. More detail info on SO

@amarynets thanks for review! Could you please clarify what you mean here? I didn't quite understand what data = {'a': 1} means here. I agree that duplicate GET arguments are possible and we should preserve them, but not add new duplicates. Do you mean it's not handled correctly in this PR? If yes, could you please clarify which case is not handled correctly?

@maramsumanth I see that some duplicate cases are checked in tests, but could you please add an explicit test for the case then there is a duplicate GET argument and that it's preserved when we add another unlreated argument? (it seems this is not tested directly).

maramsumanth commented 5 years ago

@lopuhin Thanks for your review. I have made the changes :) :+1:

maramsumanth commented 5 years ago

@whalebot-helmsman @lopuhin. I have made the necessary changes, can you review?

whalebot-helmsman commented 5 years ago

I think code will be much easier to read if you rename querystr to form_query_str and query_str to url_query_str

maramsumanth commented 5 years ago

Done @whalebot-helmsman :+1:

whalebot-helmsman commented 5 years ago

@amarynets @lopuhin @maramsumanth I think FormRequest behaviour should mirror browser behaviour. To test browser behaviour I create simple HTML file https://gist.githubusercontent.com/whalebot-helmsman/a6c3d2eb78ae57a072af11f8b726ff83/raw/3f281be8ce0f08ea5d81deb443c816b6fbe846f3/forms.html. You can download local copy and test. In case of GET form request both Chrome and Firefox drop url_query_str from URL and add form_query_str.

whalebot-helmsman commented 5 years ago

@maramsumanth Based on browser behaviour I proposed to change implementation in current PR. You up voted this proposal and didn't apply proposed changes. Is there something what prevent you from applying these changes?

maramsumanth commented 5 years ago

Thanks for reminding me @whalebot-helmsman. I was working on the Redirect Middleware part of scrapy. Coming to this PR ,I was little bit confused. In the following example:- FormRequest('http://example.com/?id=9', method='GET', formdata={'id': '8'}).url should give 'http://example.com/?id=8' i.e it should replace the value if same key is found. It is working as it should. Can you give me detailed explanation of what is the problem. :) Thanks

maramsumanth commented 5 years ago

@whalebot-helmsman , can you please look into this so that I can work on it ASAP. :+1: :)

whalebot-helmsman commented 5 years ago

I referenced HTML file as example. It has two forms. First one uses GET request, has a and b as keys and targets http://example.com/f?a=1&d=2 URL. After you click Submit button you get to http://example.com/f?a=3&b=2. Browser not only replace value for a key, but drop value for d key. Reformulating in terms of python structures we got

FormRequest('http://example.com/?id=9&smth=a', method='GET', formdata={'id': '8'}).url == 'http://example.com/?id=8'
maramsumanth commented 5 years ago

I have Ubuntu OS in which I am using chromium browser. But for FormRequest('http://example.com/?id=9&smth=a', method='GET', formdata={'id': '8'}).url my output is 'http://example.com/?smth=a&id=8' which is != 'http://example.com/?id=8' Above thing is handled in the last part of tests that I have written in this PR. Does chromium and chrome browsers behave differently?

maramsumanth commented 5 years ago

@whalebot-helmsman , when I tested the HTML file you provided, I can see that in case of GET form request both Chrome and Firefox drop url_query_str from URL and add form_query_str. But running through the command line as described in the above comment, I cannot see any problem.

whalebot-helmsman commented 5 years ago

I can see that in case of GET form request both Chrome and Firefox drop url_query_str from URL and add form_query_str.

Scrapy and related libraries should mirror browser behaviour. At this moment in master and in your branch

>>> from scrapy.http.request.form import FormRequest;FormRequest('http://example.com/?id=9&smth=a', method='GET', formdata={'id': '8'}).url == 'http://example.com/?id=8'
False

This is incorrect representation of browser behaviour. I understand, such behaviour was implemented before your PR. As we started to work on this part we should also fix this.

maramsumanth commented 5 years ago

Now I understood the problem. Is there any python code to get the browser that the user is using @whalebot-helmsman ?

whalebot-helmsman commented 5 years ago

@whalebot-helmsman , when I tested the HTML file you provided, I can see that in case of GET form request both Chrome and Firefox drop url_query_str from URL and add form_query_str.

It is a browser behaviour. We agreed on that.

Scrapy implementation is separate from installed browser. Scrapy behaviour doesn't depend on installed browser. In master branch we preserve both url_query_str and form_query_str, In branch for this PR we replace values for common keys in url_query_str by values from form_query_str. Scrapy's behaviour in both branches doesn't mirror browser behaviour. It is a problem, we should fix it.

I propose to fix it in this PR.

Current behaviour

>>> from scrapy.http.request.form import FormRequest;FormRequest('http://example.com/?id=9&smth=a', method='GET', formdata={'id': '8'}).url
'http://example.com/?smth=a&id=8'

Behaviour after fix

>>> from scrapy.http.request.form import FormRequest;FormRequest('http://example.com/?id=9&smth=a', method='GET', formdata={'id': '8'}).url
'http://example.com/?id=8'
maramsumanth commented 5 years ago

@whalebot-helmsman Thanks for guiding me. Till now I was thinking that, we have to somewhere account for the browser name and proceed accordingly. Now I have updated the PR and changed tests as you suggested :)

maramsumanth commented 5 years ago

@whalebot-helmsman , should I change anything or is it safe to merge. :+1: Thanks :)

whalebot-helmsman commented 5 years ago

It is safe to merge now. But merging takes more time and not required by GSoC rules. We can keep it as it is for now.

maramsumanth commented 5 years ago

Now its fine :) :+1: