inspirehep / inspire-next

The INSPIRE repo.
https://inspirehep.net
GNU General Public License v3.0
59 stars 69 forks source link

Rawref/publication note search support #673

Open jalavik opened 8 years ago

jalavik commented 8 years ago

Being able to search directly for Phys.Lett. B716 (2012) 1-29 would be nice. Right now this returns 0 results from the demo-records. It should return 1.

It can be implemented by adding a new indexing enhancer making the indexed record contain the aggregated information from publication_info in a new key, like rawref.

Check how this information is already generated in the templates: https://github.com/inspirehep/inspire-next/blob/master/inspirehep/base/templates/format/record/Inspire_Default_HTML_brief_macros.tpl#L35-L54

@jmartinm @kaplun

kaplun commented 8 years ago

:+1: This is indeed a quickwin, as you propose it.

The wider approach though (which would consist in passing the query to refextract/grobid and have it parsed into its component at query time) would be amazing :dancer:

aw-bib commented 8 years ago

The main problem, in my experience, with the aforementioned query is that some part is "not quite correct". So passing it to some parser is not of too much help. It may exctract 2012 properly as a year, but if it's just wrong, e.g. Or you only have the start page but not the end, you know the volume but not the year.

So, I was always wondering if the following is reasonable / possible.

Phys.Lett. B716 (2012) 1-29

would match substring in journal name due to at least Phys.Lett.. Then the query consists basically of values that show in only a subset of sensible fields like volume, year, pages.

Now, I wonder, if the ranking algorithm couldn't see the signal "journal title matches" and thus derive a boost for a records that match those numbers in precisely these fields in whatever order. Say if journal title and year and page, it is more likely the proper record than if only journal title and year, while this is less likely if I get a match for journal title and volume to match (e.g. just for the sake of the argument assuming that the volume is more often correct than the year).

Also if I get journal title and some number that doesn't match in any of the "citation" fields, it should not get boostet (e.g. a match from a reference or conference year.) This boosting would probably also address that this queries are usually "almost known item" searches.

So for the uninitiated like me this sound like the domain of relevance ranking. Especially, if a word from the title, an author or the like enters the game, which is not uncommon. Also, by experience G and friends usually performs quite good with those queries while databases usually perform badly (though they usually have the better content).

kaplun commented 8 years ago

That's why on legacy we employ the same algorithm we use in the reference extraction to try to parse the query. If a journal title, a year, an issue, a volume and a page(or range) is identified, then this is checked against existing records for publication info. In this sense we are better than Google, because Google will match also all the records having this pubnote in the reference list.

But indeed see also: https://github.com/inspirehep/inspire-next/pull/641#issuecomment-165093918

We should indeed try to build the perfect query that would exploit ranking in order to bring on top the record that happen to match the best a potential pubnote...

kaplun commented 8 years ago

Note also that users reported that they basically copy & paste references from PDFs and hope to have them to resolve to papers in INSPIRE. So the query pasted is typically as good as a typical reference. In this context the refextract/Grobid algorithms should provide out of the box the best parsing.

aw-bib commented 8 years ago

Note also that users reported that they basically copy & paste references from PDFs and hope to have them to resolve to papers in INSPIRE.

I see especially this use case, of course. This is where, to my experience, bibliographic databases struggle, even though it is one of the most common ones.

It is surprising, OTOH, how often those references, even in copy&paste, are wrong. Slight typos e.g. in the reference a missing number, two numbers turned around, or, even this can happen, a typo in the curated bibliographic record. As the copy&past thingy depends on bibliography style, input is also "unsharp by definition".

That's why I call it an "almost known item" search. My feeling is that most failures to give results stem from treating it as "known item" search.

Thus I wonder if IRL not a slightly unsharp match with proper ranking ends up in better results. It seems related to the question of "how likely is it that any n numbers occur in the reference in arbitrary position, if another independent criterion (say the journal name) effectively strongly restricts possible values for those numbers".

Ie. is really important for the search result to know how to match the values in the reference to their semantic meaning or would not just "the numbers show up in 773__ subfields if journal name matches" together with a proper boosting work better? Especially, given that some value might be wrong and one has only 2 out of 3 matching, anyway.

BTW: the approach to extract semantic meaning an then go for an exact match is used in all those OpenURL resolvers. It doesn't really work that well (unless you just add the doi to your OpenURL, of course ;)