gnames / gnverifier

GNverifier verifies scientific names against more than 100 biodiversity databases
https://verifier.globalnames.org
MIT License
19 stars 1 forks source link

any chances to search datasources by id to get accepted names? #85

Open abubelinha opened 2 years ago

abubelinha commented 2 years ago

I am considering to temporarily go back and use resolver.globalnames.org api , as long as it keeps returning better fuzzy/stem matches (for the few example names I have tested, at least).

However, doing this puts me in some other problems because my final objective is creating a regional checklist of accepted names (based on some trusted data sources).

Question 1: I believe datasource versions are identical in both resolver and gnverifier. Correct? (not sure because resolver shows a no.200 which is not in gnverifier).

Anyway, I have realized that resolver returns a (matched) "name_string" and a (matched) "taxon_id", but it does not provide any information about that name status according to datasource (accepted, synonym, ...)

verifier, on the other hand, returns a "matchedName", a "recordId" (equivalent to those of resolver), PLUS a "currentName" and "currentRecordId" at least when data source list provides them.

So question 2: is there any way to reprocess resolver's output, generating a list of "taxon_id"+"data_source_id" (I'll do this myself, of course), and send this list to gnverifier or any other gnames product which can return me back a list with "currentName" and "currentRecordId"?

If that's not possible, any other suggestion on how to do this? i.e., any chances to download a full given dataset from gnames in order to run this match locally? (by "full dataset" I mean a simple datasourceID.csv with at least these 4 columns: name, recordId, currentName, currentRecordId)

I guess there might be other apis out there (GBIF, COL) but I prefer to avoid possible problems which would arise if datasource versions are different to gnames (see #81), so some records and name's statuses could be different in both versions (or even missing in one of them).

Thanks a lot in advance

dimus commented 2 years ago

@abubelinha, first of all thanks a lot for reporting https://github.com/gnames/gnverifier/issues/83, it would be very hard to find without your spotting of the problem.

Q1: Currently gnverifier database is rebuilt from resolver databse. As a result they are almost always in sync. Currently I wait for fixes in Parasite Tracker dataset, so I will sync databases when I get update from them. When gnverifier will get its own harvester, and, hopefullly a registry, resolver will be deprecated and put on a path to be removed. I would not expect it earlier than 2023-2024.

Q2: I would prefer to solve the problem by fixing gnverifier, would it be a reasonable solution for you? Such manual 1 time tasks of matching legacy and current tools is not a good use of time I think. It is better to solve the problems for all users.

Related issues are:

https://github.com/gnames/gnmatcher/issues/45

https://github.com/gnames/gnmatcher/issues/46

abubelinha commented 2 years ago

Spotting #83 was just by pure chance, but a consequence of me comparing results returned by verifier & resolver. When I first discovered globalnames resolver, I was impressed and become so interested that I requested more datasources to be added, and asked about differences with gnverifier. You told me about resolver deprecation dates and speed improvements of gnverifier, so I just moved to verifier.

I did it blindly assuming that, except for their api differences, the results should be identical (same datasources, algorithms and matching rules, but implemented with a faster programming language). I didn't made any comparisons by that time, and trusted that the same list processed by the two services, would keep returning exactly the same matches.

But now I am finding many examples proving that was not the case. And to my surprise, when I found differences the winner was always resolver (I mean that resolver's matching results were much closer to my human matching criteria, compared to verifier's matching results. There may be examples were resolver loses, but I still didn't find anyone).

I am just guessing, but from your recent answers I understand those differences could not always be "bugs" -as I thought-, but expected behaviour? (different design, focused into a more speedy matching process).

It is better to solve the problems for all users.

Indeed I agree. But I am not sure if mine is a actually a problem for other users. From what I said above, I feel there are discrepancies with my opinion about "good" matching criteria, and probably some other people would prefer current verifier behaviour (if it was designed like that on purpose, there must be very strong reasons behind).

I also love many aspects of verifier compared to resolver (extra information about accepted/synonym status of names in some datasources, which is crucial for my use case; plus a more atomized information about matching scores, which is so good to know). But regarding speed improvements, I really don't care much about that, since my absolute priority is obtaining a "good" result.

It was in this context that I asked about the possibility of sticking myself to resolver "goodness" and stop bothering with "verifier issues" which might not be such for most of the people. I thought one important gnames use was processing names from OCR texts, where there might be misspellings. If that's the case, it really surprises me a lot that nobody else is raising issues about wrong fuzzy matches of gnverifier: just by removing last letter of species epithet, I find plenty of cases where verifier falls back to a "PartialExact" match to genus (instead of returning the correct species as a "Fuzzy match", like resolver always does). Try Quercus robur, Pinus pinaster, Zea mays, Oryza sativa ... but in verifier.

Those IMHO wrong "PartialExact" matches might sometimes be perhaps related to an "overloadDetected" warning?: "Too many variants (possibly strains), some results are truncated" (i.e. Quercus or Pinus for same queries above). Resolver has not this overload problem, and just returns the correct species by "fuzzy match". Could this be because resolver processes names grouped by data sources, instead of trying to parse all datasources' names altogether? (I wonder if this is what you meant here with "gnmatcher makes exact and fuzzy matching but without any information about data-sources. This allows to avoid calls to database which significantly speeds things up")

I would prefer to solve the problem by fixing gnverifier, would it be a reasonable solution for you?

Indeed, but I don't want to bother or go against majority.

dimus commented 2 years ago

Sometimes differences are, indeed, speed requirements, but in other cases are just playing with parameters. I will go through your examples, and analyze their behavior in gnverifier.

For cases where you know that letters are missing you can try to use search instead verify:

https://verifier.globalnames.org/api/v0/search/n:Quercus%20robu. (dot in the end is important)

The goal with resolver was to make it work The goal with verifier is to make it fast while it still works

So your feedback is very important, and helps to find where algorithms in verifier need to be tweaked

abubelinha commented 2 years ago

OK understood. Thanks.

For cases where you know that letters are missing you can try to use search instead verify:

Well, those weren't real cases. I was showing a simple example of "easy fuzzy matching tests" where resolver had success, while verifier failed (to my surprise). It's easy to find other similar examples, by making different tiny changes in the epithet string.

Another example of the relevance of catching these single-letter changes are orthographic variants of botanical names, not uncommon in literature and herbarium labels and databases. Again, resolver handles (by fuzzy matches) much better than verifier (i.e. if I try to match Jonopsidium abulense, resolver correctly fuzzy matches Ionopsidium abulense in all my 3 preferred data sources; but verifier does not match any of them, and just returns a bestResult from those datasources which used the "J" orthovariant).

But I wonder why this resolver behaviour is not always the same. For some other names, if I change just the first letter, resolver does not fuzzy match them either (example: Ulva lactuca matches in ds 195; but changed to Vlva lactuca there is no resolver fuzzy match). I am pretty confused about that.

Anyway ... IMHO those one-letter changes are cases that should always return a fuzzy match in either resolver or verifier.

I had thought about making a simple comparator between resolver and verifier outputs. Some script which followed these steps:

  1. Take a random list of N names,
  2. force a random change in each of them (choose 1st letter, last letter, a mid letter, ... choose change/remove/add/move 1 character ... or whatever "mistake" you can imagine),
  3. pass this list of erroneous names (which are still human-readable) to both resolver and verifier apis
  4. compare both apis bestResult
  5. do the same but passing preferred sources in 3 and comparing preferred results in 4
  6. Or just run steps 3-5 on a real list coming from somewhere else (no need to create it in steps 1-2)

But I was already finding differences quite easily by hand, so this script became unnecessary (you might be interested in doing it, though).

dimus commented 2 years ago

| Try Quercus robur, Pinus pinaster, Zea mays, Oryza sativa ... but in verifier.

Some of these do not work because fuzzy match works with stemmed versions. So, for example Quercus robu has stem Quercus rob which is already 2 edit distances away from Quercus robur while fuzzy threshold is 1 normally.

In general it can be solved by using edit distance 2 as a threshold (about 10 times slower than 1). I would not do it for normal use, because people are the most careful aboout the first and the last letters when they enter data. But for cases when people do not care about speed, I can introduce a parameter that would change threshold for edit distance from 1 to 2.

Besides being slow, edit distance 2 also introduces much more false positives. But in cases like yours, when you check the results manually, it would not be a problem as well.

Note that 1 or 2 edit distance is for stems, so the final edit distance can actually be 2, 3, or even 4

Probably the parameter should be called ExtraFuzzy=true :)

dimus commented 2 years ago

example: Ulva lactuca matches in ds 195; but changed to Vlva lactuca there is no resolver fuzzy match)

This happens because there is a quota on how many errors are allowed per so many letters. I recall for resolver the quota is 1 error per 6 letters, so Vlva does not get through.

The purpose of the quota is to remove false positives.

For gnverifier the quota is 1 error per 5 letters. I will try to reduce it to 4 for the next release.

dimus commented 2 years ago

Comparing resolver and gnverifier: In my case how and where resolver and gnverifier differ I do have an idea, because I wrote the code and used modified ideas from resolver in gnverifier. But there is something I do not know very well: how a researcher like you uses these two projects, what is important for people, and what is missing.

And feedback from you and others is the most useful for me in understanding usecases, and trying to bring functionality to cover as many usecases as it is possible within existing constrains.

abubelinha commented 2 years ago

I see ... quite complicated to tune the tool for every need. First I will need to dive api doc about the edit, thresholds, quotas and whatever. (I am not setting any of those right now ... not even knew I could).

As for understanding my use case. I want to validate a draft list of species known to exist in a particular region:

  1. My start draft list is constructed from a handful of heterogeneous sources: several museum + gbif occurrences + previous old-checklist for that area (the final purpose is to update this old checklist)

    • The draft list will so contain different versions of the same names, which must be homogenized.
    • Also, some of them might be synonyms of some others.
  2. Then, we'd compare this draft list against some other preferred lists, to track synonyms and extract a currentNameId. This is the reason to use gnames. But our main preferred checklist is not in gnames (and can't be, since it is not yet complete and published). Hence gndiff (or a similar tool) becomes very important.

  3. We have a preference order for matching them. So if a draft-list name matches a name in our 1st checklist, we take the currentName and currentNameId from it. If there is no match, then we look for a match in the next one ... and so on (until we flag the name "unresolved" if no currentNameId was found)

  4. As for detecting false positives, my idea is to gnparse the final matchedNames column and original draft-list (if fullCanonic does not match 100%, that name should be reviewed; and if it matches, then authorships should also be close using a string-distance tool).

It would be great if gndiff has finally options to fine-grain "configure its fuzziness" with all those numerical parameters you mention, plus let user select output detail (i.e. choosing parsed info columns and so on).

I think it would be an easier solution, rather than changing server-side variables which could improve my results but be wrong for other use cases.

dimus commented 2 years ago

I see ... quite complicated to tune the tool for every need.

Indeed. My goal is more to figure out an intersect space from existing usecases where gn tool can be of help to move towards that space.

As for detecting false positives, my idea is to gnparse the final matchedNames column and original draft-list (if fullCanonic does not match 100%, that name should be reviewed; and if it matches, then authorships should also be close using a string-distance tool).

BTW gnverifier already checks authors. The field scoreDetails -> authorMatchScore probably can help you.

dimus commented 2 years ago

It would be great if gndiff has finally options to fine-grain "configure its fuzziness" with all those numerical parameters you mention, plus let user select output detail (i.e. choosing parsed info columns and so on).

Would you make an issue at gndiff describing this?

abubelinha commented 2 years ago

I did it two weeks ago ... but perhaps it was toooooo verbose? https://github.com/gnames/gndiff/issues/13

dimus commented 2 years ago

ah sorry about that, I thought it was general thoughts, and I just did not get to it yet:) I'll take a look.

As a rule of thumb, when there is a concrete task, it is better to keep it separate from the rest, it allows me to create focused commit that addresses that particular problem.

abubelinha commented 2 years ago

As a rule of thumb, when there is a concrete task, it is better to keep it separate from the rest, it allows me to create focused commit that addresses that particular problem.

OK, I'll try to do that. I hope not to get lost among so many open issues.

BTW gnverifier already checks authors. The field scoreDetails -> authorMatchScore probably can help you.

Thanks! I opened issue #86 about this.

I see ... quite complicated to tune the tool for every need. Indeed. My goal is more to figure out an intersect space from existing usecases where gn tool can be of help to move towards that space.

I have a new concern about the many use cases: #87 I also see gndiff as a great oportunity for letting users tune default parameters without affecting apis development (scoring behaviour, return parameters required, bandwith used ...)