Develop a better method for searching on DOIs and email addresses

qjhart commented 1 year ago

Both DOI searches and email searches are not pleasant in the current search. If these were keywords instead of text would that provide for a better experience. IF that's the case, how would we make them both case insensitive, and what about the DOI: prefix that's often used on DOIs?

2023-12-11

[ ] Remove text based DOI searches

The current search includes some text-based searches on DOIs. Originally we thought this could help find similar matches, but the results are confusing. We need

Expected behavior DOIs only match the records with that exact DOI, However multiple formats are acceptable

QA / QC Will assign to VE to review when completed

Vensberg commented 11 months ago

Also, sandbox has our better keyword matches: https://sandbox.experts.library.ucdavis.edu/search/qjhart@ucdavis.edu https://sandbox.experts.library.ucdavis.edu/search/mailto:qjhart@ucdavis.edu or https://sandbox.experts.library.ucdavis.edu/search/https://doi.org/10.1016/j.compag.2018.09.042 https://sandbox.experts.library.ucdavis.edu/search/10.1016/j.compag.2018.09.042 (edited)

Vensberg commented 11 months ago

Email works perfectly. I need to understand the doi results--additional people are returned at some searches. Letters in the last part of the DOI seem to throw the search off. If the DOI is numbers, the top result is right, and there are few others.

qjhart commented 11 months ago

@Vensberg the problem with the DOIs is a function of the tokenizer, not the index. We are using a standard uax_url_email tokenizer. This is the recommended standard. However, it's too aggressive for tokenizing, Here is a good test for the elasticsearch console.

POST _analyze
{
  "tokenizer": "uax_url_email",
  "text": "doi:10.000/foobar doi:/10.000/foobar 10.000/foobar http://doi.org/10.000/fubar mailto:qjhart@ucdavis.edu qjhart@ucdavis.edu https://orcid.org/0000-0001-9829-8914 orcid:0000-0001-9829-8914"
}

Only the URLs maintain their single token. The others are split on /,- etc. The tokenizer we use is based on UTF-8, and is quite sophisticated, but maybe since most queries will be in in English, we can use something different. We will have to make it ourselves though.

There are two paths forward. One, we can be less aggresive on what we tokenize, like only use whitespace, and some punctuation. This is simple, but we might not get the best text searches, and not know why

The other is to write rules for all the expected identifiers, in this case we'd converth them all to a URL equivalent, tokenize, and then convert to the standard form. This would allow us to use the stanard tokenizer. It's more work, but we'd have a better idea what is going on.

Vensberg commented 11 months ago

I've been getting better results with the DOI weight at 20. I like the proposal for converting to URL equivalents, but depending on the effort that requires, that may have to weight for 2.1

qjhart commented 11 months ago

I have updated this so that DOIs, orcids, arks, and urns are all now single tokens as in these searches

emails

https://sandbox.experts.library.ucdavis.edu/search/qjhart@ucdavis.edu https://sandbox.experts.library.ucdavis.edu/search/mailto:qjhart@ucdavis.edu https://sandbox.experts.library.ucdavis.edu/search/ucdavis.edu

DOIs

https://sandbox.experts.library.ucdavis.edu/search/https://doi.org/10.3390/ijgi3030929 https://sandbox.experts.library.ucdavis.edu/search/doi:10.3390/ijgi3030929 https://sandbox.experts.library.ucdavis.edu/search/10.3390/ijgi3030929

Adding keywords together don't work yet, I'm not sure why,

https://sandbox.experts.library.ucdavis.edu/search/10.3390/ijgi3030929%20%2010.1021/ef400660u

but adding text searches do.

https://sandbox.experts.library.ucdavis.edu/search/10.3390/ijgi3030929%20merz

I'm on the fence about how partial DOIs should match, In some sense it matches a publisher, but is that really helpful?

https://sandbox.experts.library.ucdavis.edu/search/doi:10.3390

ORCID

https://sandbox.experts.library.ucdavis.edu/search/0000-0001-9829-8914 https://sandbox.experts.library.ucdavis.edu/search/https://orcid.org/0000-0001-9829-8914 https://sandbox.experts.library.ucdavis.edu/search/orcid:0000-0001-9829-8914

qjhart commented 11 months ago

@Vensberg see tests above. Close if looks good.

Any new changes like the multiple keywords, etc. need to be a new issue

Vensberg commented 11 months ago

Many of the issues have been solved. I wouldn't worry about partial matches; even the doi resolver doesn't do that. But now there are DOIs that bring no results. I think it may be punctuation in the last part of the DOI (".", "-"). Examples: 10.1007/s00449-012-0743-z 10.1016/j.earscirev.2022.104247

qjhart commented 11 months ago

@Vensberg I don't see either of these works in the experts profiles, Rebecca's for sure. Do you?

Vensberg commented 11 months ago

@qjhart, I retested with other DOIs (10.1038/s41598-019-48742-9, 10.3389/fevo.2021.604973), and the issue is resolved. Several DOIs I used originally for testing are no longer in those users' publications.

ucd-library / aggie-experts