Closed qjhart closed 11 months ago
Also, sandbox has our better keyword matches: https://sandbox.experts.library.ucdavis.edu/search/qjhart@ucdavis.edu https://sandbox.experts.library.ucdavis.edu/search/mailto:qjhart@ucdavis.edu or https://sandbox.experts.library.ucdavis.edu/search/https://doi.org/10.1016/j.compag.2018.09.042 https://sandbox.experts.library.ucdavis.edu/search/10.1016/j.compag.2018.09.042 (edited)
Email works perfectly. I need to understand the doi results--additional people are returned at some searches. Letters in the last part of the DOI seem to throw the search off. If the DOI is numbers, the top result is right, and there are few others.
@Vensberg the problem with the DOIs is a function of the tokenizer, not the index. We are using a standard uax_url_email tokenizer. This is the recommended standard. However, it's too aggressive for tokenizing, Here is a good test for the elasticsearch console.
POST _analyze
{
"tokenizer": "uax_url_email",
"text": "doi:10.000/foobar doi:/10.000/foobar 10.000/foobar http://doi.org/10.000/fubar mailto:qjhart@ucdavis.edu qjhart@ucdavis.edu https://orcid.org/0000-0001-9829-8914 orcid:0000-0001-9829-8914"
}
Only the URLs maintain their single token. The others are split on /,- etc. The tokenizer we use is based on UTF-8, and is quite sophisticated, but maybe since most queries will be in in English, we can use something different. We will have to make it ourselves though.
There are two paths forward. One, we can be less aggresive on what we tokenize, like only use whitespace, and some punctuation. This is simple, but we might not get the best text searches, and not know why
The other is to write rules for all the expected identifiers, in this case we'd converth them all to a URL equivalent, tokenize, and then convert to the standard form. This would allow us to use the stanard tokenizer. It's more work, but we'd have a better idea what is going on.
I've been getting better results with the DOI weight at 20. I like the proposal for converting to URL equivalents, but depending on the effort that requires, that may have to weight for 2.1
I have updated this so that DOIs, orcids, arks, and urns are all now single tokens as in these searches
https://sandbox.experts.library.ucdavis.edu/search/qjhart@ucdavis.edu https://sandbox.experts.library.ucdavis.edu/search/mailto:qjhart@ucdavis.edu https://sandbox.experts.library.ucdavis.edu/search/ucdavis.edu
https://sandbox.experts.library.ucdavis.edu/search/https://doi.org/10.3390/ijgi3030929 https://sandbox.experts.library.ucdavis.edu/search/doi:10.3390/ijgi3030929 https://sandbox.experts.library.ucdavis.edu/search/10.3390/ijgi3030929
Adding keywords together don't work yet, I'm not sure why,
https://sandbox.experts.library.ucdavis.edu/search/10.3390/ijgi3030929%20%2010.1021/ef400660u
but adding text searches do.
https://sandbox.experts.library.ucdavis.edu/search/10.3390/ijgi3030929%20merz
I'm on the fence about how partial DOIs should match, In some sense it matches a publisher, but is that really helpful?
https://sandbox.experts.library.ucdavis.edu/search/doi:10.3390
https://sandbox.experts.library.ucdavis.edu/search/0000-0001-9829-8914 https://sandbox.experts.library.ucdavis.edu/search/https://orcid.org/0000-0001-9829-8914 https://sandbox.experts.library.ucdavis.edu/search/orcid:0000-0001-9829-8914
@Vensberg see tests above. Close if looks good.
Any new changes like the multiple keywords, etc. need to be a new issue
Many of the issues have been solved. I wouldn't worry about partial matches; even the doi resolver doesn't do that. But now there are DOIs that bring no results. I think it may be punctuation in the last part of the DOI (".", "-"). Examples: 10.1007/s00449-012-0743-z 10.1016/j.earscirev.2022.104247
@Vensberg I don't see either of these works in the experts profiles, Rebecca's for sure. Do you?
@qjhart, I retested with other DOIs (10.1038/s41598-019-48742-9, 10.3389/fevo.2021.604973), and the issue is resolved. Several DOIs I used originally for testing are no longer in those users' publications.
Both DOI searches and email searches are not pleasant in the current search. If these were keywords instead of text would that provide for a better experience. IF that's the case, how would we make them both case insensitive, and what about the DOI: prefix that's often used on DOIs?
2023-12-11
The current search includes some text-based searches on DOIs. Originally we thought this could help find similar matches, but the results are confusing. We need
Expected behavior DOIs only match the records with that exact DOI, However multiple formats are acceptable
QA / QC Will assign to VE to review when completed