lukasschwab / arxiv.py

Python wrapper for the arXiv API
MIT License
1.11k stars 123 forks source link

Missing documentation of expected compound-query encoding #90

Closed zlatko-minev closed 2 years ago

zlatko-minev commented 2 years ago

Motivation

A clear and concise description of what the problem is. For example, "I'm always frustrated when..."

Need to do advanced query for arxiv such as ?search_query=au:del_maestro+AND+ti:checkerboard

The problem is that urlencode encodes certain key characters such as colon. @IceKhan13

This is so we can use compound queries and

image

Solution

A clear and concise description of what you want to happen.

Quick and dirty patch solution. WARNING: Not backward compatible

class ClientZ(arxiv.Client):
    def _format_url(self, search: arxiv.Search, start: int, page_size: int) -> str:
        """
        Construct a request API for search that returns up to `page_size`
        results starting with the result at index `start`.

        PATCH: so that we can do Boolean expression.
        """
        url_args = search._url_args()
        url_args.update({
            "start": start,
            "max_results": page_size,
        })
        # return self.query_url_format.format(urlencode(url_args)) # REPLACED THIS
        search_query = url_args.pop('search_query')  # Pop out and treat separate
        text = f"search_query={search_query}&" + urlencode(url_args) # recombine
        return self.query_url_format.format(text)

Considered alternatives

A clear and concise description of any alternative solutions or features you've considered.

Additional context

Add any other context about the feature request here.

lukasschwab commented 2 years ago

Hi Zlatko––thanks for taking the time to open an issue! If I understand correctly, I think the issue here is underdocumented usage!

In the existing code, (Client)._format_url(...) assumes the query string is unencoded. The example expression au:del_maestro+AND+ti:checkerboard is already partially URL-encoded (plusses for spaces), so it gets double-encoded. The encoded :s aren't the issue; it's encoding +%2B instead of `→+`.

Unencoded compound queries (with spaces rather than plusses) work:

>>> import arxiv
>>> c = arxiv.Client()
>>>
>>> # Pre-encoded query yields a double-encoded query URL.
>>> c._format_url(arxiv.Search(query="au:del_maestro+AND+ti:checkerboard"), 0, 100)
'http://export.arxiv.org/api/query?search_query=au%3Adel_maestro%2BAND%2Bti%3Acheckerboard&id_list=&sortBy=relevance&sortOrder=descending&start=0&max_results=100'
>>>
>>> # Unencoded queries yields the expected query URL.
>>> search = arxiv.Search(query="au:del_maestro AND ti:checkerboard")
>>> c._format_url(search, 0, 100)
'http://export.arxiv.org/api/query?search_query=au%3Adel_maestro+AND+ti%3Acheckerboard&id_list=&sortBy=relevance&sortOrder=descending&start=0&max_results=100'
>>> # Search results include the expected article.
>>> next(c.results(search))
arxiv.Result(entry_id='http://arxiv.org/abs/cond-mat/0603029v1', updated=datetime.datetime(2006, 3, 2, 2, 22, 45, tzinfo=datetime.timezone.utc), published=datetime.datetime(2006, 3, 2, 2, 22, 45, tzinfo=datetime.timezone.utc), title='From stripe to checkerboard order on the square lattice in the presence of quenched disorder', authors=[arxiv.Result.Author('Adrian Del Maestro'), arxiv.Result.Author('Bernd Rosenow'), arxiv.Result.Author('Subir Sachdev')], summary='We discuss the effects of quenched disorder on a model of charge density wave\n(CDW) ordering on the square lattice. Our model may be applicable to the\ncuprate superconductors, where a random electrostatic potential exists in the\nCuO2 planes as a result of the presence of charged dopants. We argue that the\npresence of a random potential can affect the unidirectionality of the CDW\norder, characterized by an Ising order parameter. Coupling to a unidirectional\nCDW, the random potential can lead to the formation of domains with 90 degree\nrelative orientation, thus tending to restore the rotational symmetry of the\nunderlying lattice. We find that the correlation length of the Ising order can\nbe significantly larger than the CDW correlation length. For a checkerboard CDW\non the other hand, disorder generates spatial anisotropies on short length\nscales and thus some degree of unidirectionality. We quantify these disorder\neffects and suggest new techniques for analyzing the local density of states\n(LDOS) data measured in scanning tunneling microscopy experiments.', comment='10 pages, 11 figures; added reference', journal_ref='Phys. Rev. B 74, 024520 (2006)', doi='10.1103/PhysRevB.74.024520', primary_category='cond-mat.str-el', categories=['cond-mat.str-el', 'cond-mat.supr-con'], links=[arxiv.Result.Link('http://dx.doi.org/10.1103/PhysRevB.74.024520', title='doi', rel='related', content_type=None), arxiv.Result.Link('http://arxiv.org/abs/cond-mat/0603029v1', title=None, rel='alternate', content_type=None), arxiv.Result.Link('http://arxiv.org/pdf/cond-mat/0603029v1', title='pdf', rel='related', content_type=None)])

I'll leave this issue open and push some improved documentation.

lukasschwab commented 2 years ago

Updated docs are live: http://lukasschwab.me/arxiv.py/index.html#Search.query