internetarchive / fatcat-scholar

search interface for scholarly works
https://scholar.archive.org
Other
78 stars 14 forks source link

Better query parsing #7

Open bnewbold opened 3 years ago

bnewbold commented 3 years ago

A particular user request is to be able to paste a citation string into the search box and have "the right thing" happen in most cases. The current query parser (Elasticsearch's built-in) doesn't work well for this; it is expecting a structured query string (with booleans etc).

A great solution would be a custom query parser with perfect detection of user intent that "does the expected thing". In the meanwhile, more practically, we could try to differentiate between regular queries and citation string queries, and have two code paths. The query string path would be the current behavior. The query string path would use, eg, GROBID and/or biblio-glutton to parse the raw citation in to a structured citation, then try to do a fuzzy match against the live fatcat metadata index (generally faster than the scholar fulltext index), and if there is a hit do an exact identifier lookup against scholar elasticsearch. The later half of this code path would be similar to the current behavior for identifier lookups (eg, remove all filters and sort order).

bnewbold commented 3 years ago

Here is a Google Scholar blog post about detecting reference strings: https://scholar.googleblog.com/2016/01/quickly-lookup-references.html

The jargon-y term for this use case is "known item lookup"

bnewbold commented 3 years ago

An initial version of this has been implemented and is live. Testing and iteration probably needed.

bnewbold commented 3 years ago

Some user queries are getting re-written poorly with the current system:

"journal:" Post Communist Economies "year:" 2021
"Title:" A multi-speed fiscal "Europe?" Fiscal rules and fiscal performance in the EU former communist countries. It appears to be online content from Post Communist Economies 31Jan 2021. "Link:" "https://www-tandfonline-com.libproxy-imf.imf.org/doi/full/10.1080/14631377.2020.1867432"

The original query was probably:

journal: Post Communist Economies year: 2021

Some of this may be due to copy/paste from other sources? Eg, an email or multi-line record on a website.

For one thing, we probably shouldn't return the re-written (quoted) query, we should return the original query string (in the search box). Any time we rewrite/modify the query, should indicate that it happened though, and link to query documentation.

Other possible improvements or work arounds are to have an "advanced search" page, or to have separate search boxes/options for different types of query. I'd like to try a little more to stick with the "one simple box" experience though.