lukasschwab commented 5 years ago

Atomic conditions

condition(field: "all"|"au"|..., value: string):

condition("au", "Balents Leon") → "au:\"Balents Leon\""
condition("au", "balents_leon") → "au:balents_leon"
condition("cat", "cond-mat.str-el") → "au:cond-mat.str-el"
Open question: how to enumerate the available fields, values when they're enumerable.

prefix	explanation
ti	Title
au	Author
abs	Abstract
co	Comment
jr	Journal Reference
cat	Subject Category
rn	Report Number
id	Id (use id_list instead)
all	All of the above

Boolean assembly

These correspond to the three Boolean operators supported by the arXiv API.

and(cond1, cond2) → "$(cond1) AND $(cond2)"

or(cond1, cond2) → "$(cond1) OR $(cond2)"

andnot(cond1, cond2) → "$(cond1) ANDNOT $(cond2)"

Grouping

group(cond) → "($(cond))"

lukasschwab commented 3 years ago

An example of some more advanced query construction: https://github.com/lukasschwab/arxiv.py/issues/83#issuecomment-907967099

Open question: how to enumerate the available fields, values when they're enumerable.

Enums should be useful here, e.g. a (Query).Attribute enum: Attribute.Title, Attribute.Author, and so on.

The atomic conditions––arguments to and, or, andnot––are actually valid queries themselves. Which means they could be instead exposed as methods on the Query class:

cond1.and(cond2)
cond1.or(cond2)
cond1.andnot(cond2).andnot(cond3): I think this chaining is more literate than andnot(andnot(cond1, cond2), cond3).

It'd be nice to have arXiv as a source of truth for an enum of categories... but I might just have to accept the risk that the categories will change, and that new categories will have to be integrated here as patch releases. This is a good reason not to transform categories on Results into the enum type: new categories may not be explicitly queryable, but they should not break processing results.

Implementation detail: can build the string as we go along, or can assemble a tree which gets traversed to build the string.

We can use excessive grouping to convert queries to strings (so a.or(b) can yield (a) OR (b)). The group function expressed above may be unnecessary––group(a, b) may just be equivalent to a.or(b).

Good opportunity to define an interface and then write tests.

romazu commented 2 weeks ago

Hey, guys!

I just published arxivql (repo) package that helps with building arXiv search queries. It supports all field filters and all logical operators (AND, OR, ANDNOT) to combine them in a pythonic way. It also conveniently includes full arXiv category taxonomy to construct category filters.

Here's a quick example:

from arxivql import Query as Q, Taxonomy as T

query = Q.author("Ilya Sutskever") & Q.title("autoencoders") & ~Q.category(T.cs.AI)

# The query above generates:
# ((au:"Ilya Sutskever" AND ti:autoencoders) ANDNOT cat:cs.AI)

The package is currently standalone (pip install arxivql) and works great with this client. I'm open to discussing whether it makes sense to merge it into the main arxiv package, if there is such interest, or keep it separate. Let me know what you think!

PS: The core functionality is pretty robust. I used it in production for a couple of years. There still can be some minor problems, as I polished it for publishing and made constructor semantics more consistent across the field filters.

lukasschwab commented 2 weeks ago

@romanzu this looks really good at first glance! I'll take a look.

At the very least, I'll probably add a README reference to your work and start directing people to it (query construction is a common GitHub issue question).

lukasschwab / arxiv.py

Query string helpers #30

Atomic conditions

Boolean assembly

Grouping