lukasschwab / arxiv.py

Python wrapper for the arXiv API
MIT License
1.07k stars 120 forks source link

Weird url encoding problem of arxiv API #20

Closed refraction-ray closed 5 years ago

refraction-ray commented 5 years ago

This issue may not be a bug of this package but instead something related to how arXiv API accepts the encoded URL.

Say I want to make a query with multiple search fields. eg. sq="au:balents_leon+AND+cat:cond-mat.str-el", then I would like to use the wrapper function arxiv.query(search_query=sq) to get the results. However, this doesn't work. The reason is related with urlencode() function from urllib, which encode the URL, and turns : to %3A and + to %2B. This should be fine since the encoded url is the same thing as the original one. However, it turns out arxiv responses differently to the url and the encoded url. The experiments are as follows, which are tested directly on Chrome.

  1. http://export.arxiv.org/api/query?search_query=au%3Abalents_leon+AND%2Bcat%3Acond-mat.str-el: return items.
  2. http://export.arxiv.org/api/query?search_query=au%3Abalents_leon%2BAND+cat:cond-mat.str-el: return items.
  3. http://export.arxiv.org/api/query?search_query=au%3Abalents_leon%2BAND%2Bcat:cond-mat.str-el: return with only info on atom feed without any real items of papers.

Note the subtle difference in the encoding, namely only when two + are both encoded, the arxiv API reacts unexpectedly.

refraction-ray commented 5 years ago

Some updates.

The workaround is using space instead of + when constructing search queries. i.e. sq="au:balents_leon AND+cat:cond-mat.str-el".

The origin of the issue is the ambiguity in url encoding. + is sometimes regarded as an encoded version of space. While %20 is also reserved for encoded space. For two + in the query url, arxiv responses correctly only when at least one of them can be taken as a space, i.e. a raw + or %20 in the url. If two of them are both encoded +, i.e. %2B, the response is incorrect.

Therefore, maybe we need to add a user case on how to construct a multiple field search query correctly.

lukasschwab commented 5 years ago

Thanks for pointing this out! I'm confident there's a nice solution here. I'll tackle it in a couple of weeks.

lukasschwab commented 5 years ago

Okay, I think the right solution here is a change in the documentation.

The arXiv documentation suggests a query string like au:balents_leon+AND+cat:cond-mat.str-el because it assumes the query string won't be passed through an url-encoding function (this package currently uses urllib.parse.urlencode). urlencode() replaces + characters in the query string with '%2B':

>>> urlencode({'q': 'au:balents_leon+AND+cat:cond-mat.str-el'})
'q=au%3Abalents_leon%2BAND%2Bcat%3Acond-mat.str-el'
>>> urlencode({'q': 'au:balents_leon AND cat:cond-mat.str-el'})
'q=au%3Abalents_leon+AND+cat%3Acond-mat.str-el'

urlencode expects unencoded fields, and I think this points at the correct behavior for a wrapper: this package should hide the URL encoding process as much as possible.

I'll update documentation to require space-delimited query strings.

Thanks!


One note: in the python 3 version of urllib, urlencode takes a quote function argument––we could pass it a wrapper of quote_plus that treats + as a safe character. This would cause some (surmountable) back-compatibility issues and could cause weird behavior when + is used in other ways (e.g. if you want to find all papers with + in their title).

lukasschwab commented 5 years ago

Added the new documentation: https://github.com/lukasschwab/arxiv.py/commit/0fc0eef3c7828d72834ba5adc52b86bc80a3122a

Closing!