globalbioticinteractions / nomer

maps identifiers and names to other identifiers and names
GNU General Public License v3.0
17 stars 3 forks source link

parsing timeout when using gbif-parse #89

Open jhpoelen opened 1 year ago

jhpoelen commented 1 year ago

to reproduce:

$ echo -e "\tReithrodontomys FULVESCENS GRISEOFLAVUS" |  nomer append gbif-parse
[main] INFO org.globalbioticinteractions.nomer.match.TermMatcherRegistry - using matcher [gbif-parse]
[main] WARN org.gbif.nameparser.NameParserGBIF - Parsing timeout for name: Reithrodontomys FULVESCENS GRISEOFLAVUS
    Reithrodontomys FULVESCENS GRISEOFLAVUS SAME_AS     Reithrodontomys FULVESCENS GRISEOFLAVUS                         
jhpoelen commented 1 year ago

extracted using:

$ preston track "https://ipt.lsa.umich.edu/archive.do?r=ummz_mammals"\
preston dwc-stream\
preston grep "Reithrodontomys FULVESCENS GRISEOFLAVUS"\
head -n1\
jq .

resulted in:

{
  "http://www.w3.org/ns/prov#wasDerivedFrom": "line:zip:hash://sha256/673d16b708ce0f94faabfa037b13e515a36144e405ad80d63280e75d5f85cc82!/occurrence.txt!/L130",
  "http://www.w3.org/1999/02/22-rdf-syntax-ns#type": "http://rs.tdwg.org/dwc/terms/Occurrence",
  "http://rs.tdwg.org/dwc/text/id": "5127278f-211a-4e99-bbc1-a3ba9ac14c4e",
  "http://rs.tdwg.org/dwc/terms/occurrenceID": "5127278f-211a-4e99-bbc1-a3ba9ac14c4e",
  "http://rs.tdwg.org/dwc/terms/higherGeography": "MEXICO, JALISCO",
  "http://rs.tdwg.org/dwc/terms/stateProvince": "JALISCO",
  "http://rs.tdwg.org/dwc/terms/taxonomicStatus": "Current",
  "http://rs.tdwg.org/dwc/terms/institutionCode": "UMMZ",
  "http://rs.tdwg.org/dwc/terms/decimalLongitude": "-103.0245847000",
  "http://rs.tdwg.org/dwc/terms/georeferenceVerificationStatus": "unverified",
  "http://rs.tdwg.org/dwc/terms/basisOfRecord": "PreservedSpecimen",
  "http://rs.tdwg.org/dwc/terms/occurrenceRemarks": null,
  "http://rs.tdwg.org/dwc/terms/recordNumber": "3389",
  "http://rs.tdwg.org/dwc/terms/verbatimLocality": "MEXICO: JALISCO:  CO.; 0.5MI NW MAZAMITLA; 19.9207047, -103.0245847;",
  "http://rs.tdwg.org/dwc/terms/verbatimLongitude": "-103.02458470000001",
  "http://rs.tdwg.org/dwc/terms/catalogNumber": "100128",
  "http://rs.tdwg.org/dwc/terms/scientificName": "Reithrodontomys FULVESCENS GRISEOFLAVUS",
  "http://rs.tdwg.org/dwc/terms/kingdom": "Animalia",
  "http://rs.tdwg.org/dwc/terms/otherCatalogNumbers": null,
  "http://rs.tdwg.org/dwc/terms/country": "MEXICO",
  "http://rs.tdwg.org/dwc/terms/georeferencedDate": null,
  "http://rs.tdwg.org/dwc/terms/family": "Cricetidae",
  "http://rs.tdwg.org/dwc/terms/nomenclaturalCode": "ICZN",
  "http://rs.tdwg.org/dwc/terms/recordedBy": "Hooper, E.",
  "http://rs.tdwg.org/dwc/terms/georeferencedBy": "Lucy Tran",
  "http://rs.tdwg.org/dwc/terms/verbatimLatitude": "19.920704700000002",
  "http://rs.tdwg.org/dwc/terms/collectionCode": "mammals",
  "http://rs.tdwg.org/dwc/terms/eventDate": "02-07-1953",
  "http://rs.tdwg.org/dwc/terms/georeferenceRemarks": null,
  "http://rs.tdwg.org/dwc/terms/month": "02",
  "http://rs.tdwg.org/dwc/terms/coordinateUncertaintyInMeters": "0.9940000000",
  "http://rs.tdwg.org/dwc/terms/georeferenceProtocol": "MaNIS georeferencing guidelines",
  "http://rs.tdwg.org/dwc/terms/specificEpithet": "FULVESCENS",
  "http://rs.tdwg.org/dwc/terms/verbatimElevation": null,
  "http://rs.tdwg.org/dwc/terms/geodeticDatum": "NAD27",
  "http://rs.tdwg.org/dwc/terms/phylum": "Chordata",
  "http://rs.tdwg.org/dwc/terms/preparations": "SKIN - 1|SKULL - 1",
  "http://purl.org/dc/terms/modified": "2015-10-13",
  "http://rs.tdwg.org/dwc/terms/order": "Rodentia",
  "http://rs.tdwg.org/dwc/terms/year": "1953",
  "http://rs.tdwg.org/dwc/terms/day": "07",
  "http://rs.tdwg.org/dwc/terms/continent": "NORTH AMERICA",
  "http://rs.tdwg.org/dwc/terms/establishmentMeans": "Native",
  "http://rs.tdwg.org/dwc/terms/genus": "Reithrodontomys",
  "http://rs.tdwg.org/dwc/terms/lifeStage": null,
  "http://rs.tdwg.org/dwc/terms/county": null,
  "http://rs.tdwg.org/dwc/terms/decimalLatitude": "19.9207047000",
  "http://rs.tdwg.org/dwc/terms/reproductiveCondition": null,
  "http://rs.tdwg.org/dwc/terms/georeferenceSources": "Localidades 2000, INEGI",
  "http://rs.tdwg.org/dwc/terms/occurrenceStatus": "Present",
  "http://rs.tdwg.org/dwc/terms/typeStatus": null,
  "http://rs.tdwg.org/dwc/terms/higherClassification": null,
  "http://rs.tdwg.org/dwc/terms/class": "Mammalia",
  "http://rs.tdwg.org/dwc/terms/taxonRank": "subspecies",
  "http://rs.tdwg.org/dwc/terms/sex": "FEMALE",
  "http://rs.tdwg.org/dwc/terms/verbatimCoordinateSystem": "decimal degrees",
  "http://rs.tdwg.org/dwc/terms/verbatimEventDate": "7 FEBRUARY 1953",
  "http://rs.tdwg.org/dwc/terms/fieldNumber": null,
  "http://rs.tdwg.org/dwc/terms/locality": "0.5MI NW MAZAMITLA",
  "http://rs.tdwg.org/dwc/terms/infraspecificEpithet": "GRISEOFLAVUS"
}
jhpoelen commented 1 year ago

Note that UMMZ record mentioned above is interpreted and indexed only it's genus Reithrodontomys, instead of, what I assume, its full scientific name subspecific name -

Reithrodontomys FULVESCENS GRISEOFLAVUS

from https://www.gbif.org/occurrence/1987272568 accessed at 2022-07-07

Screenshot from 2022-07-07 16-18-01

Screenshot from 2022-07-07 16-25-28

jhpoelen commented 1 year ago

A rough estimate of just UMMZ Mammals records left only crudely indexed by GBIF would be:

$ preston ls | preston dwc-stream | jq --raw-output '.["http://rs.tdwg.org/dwc/terms/scientificName"]' | grep -P "[A-Z]{2,}" | sort | uniq -c | wc -l
1942

affect names associated with no more than $ preston ls | preston dwc-stream | jq --raw-output '. | select(.["http://www.w3.org/1999/02/22-rdf-syntax-ns#type"] =="http://rs.tdwg.org/dwc/terms/Occurrence") | .["http://rs.tdwg.org/dwc/terms/scientificName"]' | grep "[A-Z][A-Z]" | wc -l 91392 91k out of

 $ preston ls | preston dwc-stream | jq --raw-output '. | select(.["http://www.w3.org/1999/02/22-rdf-syntax-ns#type"] =="http://rs.tdwg.org/dwc/terms/Occurrence") | .["http://rs.tdwg.org/dwc/terms/scientificName"]' | grep "[A-Za-z]" | wc -l
128087

128k specimen records, leaving potentially more than half of UMMZ Mammal records indexed by GBIF only to the genus level.

jhpoelen commented 1 year ago

@mdoering I am noticing intermittent parsing failures on long names with all caps. I noticed related issues. Given that this would happen a bunch with millions of names, I wonder whether you are planning to resolve this issue?

Thanks for all your hard work in providing this neat modular java taxonomic name parser library.

related issues https://github.com/gbif/name-parser/issues/51 https://github.com/gbif/name-parser/issues/12

mdoering commented 1 year ago

Do you have examples of such problem cases? The issue to linked to is about long authorships, but not all capital names.

jhpoelen commented 1 year ago

@mdoering thanks for asking for examples:

You'll find one here:

https://github.com/globalbioticinteractions/nomer/blob/0170ecc987c8eddb4bc44177ce962718ccf88cf0/nomer-name-correct/src/test/java/org/eol/globi/taxon/NameParserTestBase.java#L24

This names appeared, as stated earlier, in https://github.com/globalbioticinteractions/nomer/issues/89#issue-1297909793 , in UMMZ mammals

found via:

preston track "https://ipt.lsa.umich.edu/archive.do?r=ummz_mammals"\
 | preston dwc-stream\
 | preston grep "Reithrodontomys FULVESCENS GRISEOFLAVUS"\
 | head -n1\
 | jq .

Given the one second timeout in the gbif code (spinning up a new timed job for each parse event), explains the intermittent nature of the crashes, especially when running on virtualized hardware like those used by GitHub actions.

I hope this example help - and am curious to hear your thoughts on this.

mdoering commented 1 year ago

That is a tricky one as the genus is not capitalized. The capital epithets could as well be authors. Not sure if I can find a fix quickly.

The separate threads and timeouts are unfortunately needed and part of the design, as regexes can not be guaranteed to finish and there is no other way to stop a runaway regex than to kill its thread. You can increase the timeout in your code when using the parser, also to reflect your hardware. There will always be unparsable names though, sth to code for I am afraid.

jhpoelen commented 1 year ago

@mdoering I can see how having to align millions of names with varying formatting due to local custom, existing data exports etc. I can see why you had to introduce the timeout for the parsing - to me this is one of these seemingly hack-y solution that actually conveys a ton of experience and pragmatism.

Thanks for taking the time to review.

Perhaps on thing to do here is to expect intermittent failures somehow, and by pre-processing known funny suspect names. These names are in the record forever, so we'll have to find a way to make do.

mdoering commented 1 year ago

Yes. And Ill try my best to improve the parser (which has that preprocessing discovery already for various problem cases). This case is just tricky because even as a human it is not entirely obvious what it is. I guess the latin endings convince us that these all caps words are epithets not authors?

But I wanted to point out the UnparsableException thrown due to timeout or other reason is part of the parsers interface contract. This is mostly because of unpredictable regex performance https://bugs.openjdk.org/browse/JDK-8260688

jhpoelen commented 1 week ago
echo -e "\tReithrodontomys FULVESCENS GRISEOFLAVUS"\
 |  nomer append gbif-parse

now producing a

[main] WARN org.gbif.nameparser.NameParserGBIF - Parsing timeout for: Reithrodontomys FULVESCENS GRISEOFLAVUS
    Reithrodontomys FULVESCENS GRISEOFLAVUS SAME_AS     Reithrodontomys FULVESCENS GRISEOFLAVUS                             

With current version of Nomer/gbif-parser, a name that causes a timeout is returned. Neat to see that there's a specific timeout exception thrown, informing the API called of the issue at hand.