inspirehep / inspire-next

The INSPIRE repo.
https://inspirehep.net
GNU General Public License v3.0
59 stars 69 forks source link

Single and double quoted search values #512

Closed Panos512 closed 6 years ago

Panos512 commented 8 years ago

On the current labs when a search contains double quoted values elasticsearch tries to match the exact same value with a record. But, since elasticsearch adds a score to every result, it is sure that the exact match of the value will rank higher than any other result. Concerning the single quotes elasticsearch now searches for a phrase-match. This means that it tries to match the exact phrase given as a search value to a records data.

e.g. t: 'brown fox' will match with 'A tail of a brown fox' on the other hand t: "brown fox" will only match with 'brown fox'

Is this the functionality we want?

jmartinm commented 8 years ago

@kaplun As far as I know yes, that is the intended behaviour. This is the current behaviour on inspirehep.net (see https://inspirehep.net/help/search-guide?ln=en, section Searching for words versus phrases) and it would make sense to keep it.

salmele commented 8 years ago

I am unsure about the way the average users do understand the search per words vs search per phrases. This would be important knowledge before committing to a behavior.

Can we check the log files of inspirehep.net to see how often users use 'single' and "double" quotes in their queries?

And whether the previous action to a query with quotes was one without? And whether the following action to one with single or double quotes was one with double or single (to see whether users are just trying to guess which quotes give which result) ?

kaplun commented 8 years ago

I added some fresh food for thought on Asana (since it contains statistics from Apache logs): See: https://app.asana.com/0/18332606838148/65788758406938

bing13 commented 8 years ago

I did a quick scan, and the doubles vastly outnumber the singles (factor of 10 almost?) I only saw one case where the user changed from single to double to single .. 110u:"Tokyo U., ICRR" .. 371a:'Tokyo U., ICRR' .. 110__u:"Tokyo U., ICRR" It looks like the data includes repeats of the search when the user pages through a result, so not clear how to interpret a comparison.

kaplun commented 8 years ago

Good point. I will strip away jrec=* queries

salmele commented 8 years ago

So this seems to point rather to folks using doubles as big G does, rather than knowing the subtle Invenio differences between single and double, non?

As mentioned asynchronously, it would be good to know which % of queries uses quotes altogether (trying to exclude canned searches....)

kaplun commented 8 years ago

the 1.2% of the total searches I captured were using quotes. (note that in principle canned searches are already well excluded).

The interesting bit is that ~40% of the queries are using the SPIRES syntax, i.e. "find/fin/f + something"

kaplun commented 8 years ago

(unfortunately from my log, which is based on pure Apache logs, I can't easily factor out cataloguers, albeit these are only the unauthenticated users...)

bing13 commented 8 years ago

When I did calculations a few months ago I came out with ~45%, iirc. That excludes users who still issues SPIRES syntax searches without beginning with "find" or "f".

jmartinm commented 8 years ago

So this was in the end not really decided.

Just to add some more information about how other people solve this problem, in ElasticSearch (if using their own search syntax) they only support double quotes and treat them as a phrase search.

I think that would be enough for users.

On the other hand, some admins and catalogers might benefit from the exact value query.

chris-asl commented 6 years ago

What we currently support is " for exact searches, e.g. keyword "ALEPH" will return only records with an exact keyword ALEPH (all caps). While also, ' for partial matching of phrases. For example, a query like keyword 'ALEPH' will return results containing records that have also aleph or alephone.