Closed jhpoelen closed 4 months ago
extracted using:
$ preston track "https://ipt.lsa.umich.edu/archive.do?r=ummz_mammals"\
preston dwc-stream\
preston grep "Reithrodontomys FULVESCENS GRISEOFLAVUS"\
head -n1\
jq .
resulted in:
{
"http://www.w3.org/ns/prov#wasDerivedFrom": "line:zip:hash://sha256/673d16b708ce0f94faabfa037b13e515a36144e405ad80d63280e75d5f85cc82!/occurrence.txt!/L130",
"http://www.w3.org/1999/02/22-rdf-syntax-ns#type": "http://rs.tdwg.org/dwc/terms/Occurrence",
"http://rs.tdwg.org/dwc/text/id": "5127278f-211a-4e99-bbc1-a3ba9ac14c4e",
"http://rs.tdwg.org/dwc/terms/occurrenceID": "5127278f-211a-4e99-bbc1-a3ba9ac14c4e",
"http://rs.tdwg.org/dwc/terms/higherGeography": "MEXICO, JALISCO",
"http://rs.tdwg.org/dwc/terms/stateProvince": "JALISCO",
"http://rs.tdwg.org/dwc/terms/taxonomicStatus": "Current",
"http://rs.tdwg.org/dwc/terms/institutionCode": "UMMZ",
"http://rs.tdwg.org/dwc/terms/decimalLongitude": "-103.0245847000",
"http://rs.tdwg.org/dwc/terms/georeferenceVerificationStatus": "unverified",
"http://rs.tdwg.org/dwc/terms/basisOfRecord": "PreservedSpecimen",
"http://rs.tdwg.org/dwc/terms/occurrenceRemarks": null,
"http://rs.tdwg.org/dwc/terms/recordNumber": "3389",
"http://rs.tdwg.org/dwc/terms/verbatimLocality": "MEXICO: JALISCO: CO.; 0.5MI NW MAZAMITLA; 19.9207047, -103.0245847;",
"http://rs.tdwg.org/dwc/terms/verbatimLongitude": "-103.02458470000001",
"http://rs.tdwg.org/dwc/terms/catalogNumber": "100128",
"http://rs.tdwg.org/dwc/terms/scientificName": "Reithrodontomys FULVESCENS GRISEOFLAVUS",
"http://rs.tdwg.org/dwc/terms/kingdom": "Animalia",
"http://rs.tdwg.org/dwc/terms/otherCatalogNumbers": null,
"http://rs.tdwg.org/dwc/terms/country": "MEXICO",
"http://rs.tdwg.org/dwc/terms/georeferencedDate": null,
"http://rs.tdwg.org/dwc/terms/family": "Cricetidae",
"http://rs.tdwg.org/dwc/terms/nomenclaturalCode": "ICZN",
"http://rs.tdwg.org/dwc/terms/recordedBy": "Hooper, E.",
"http://rs.tdwg.org/dwc/terms/georeferencedBy": "Lucy Tran",
"http://rs.tdwg.org/dwc/terms/verbatimLatitude": "19.920704700000002",
"http://rs.tdwg.org/dwc/terms/collectionCode": "mammals",
"http://rs.tdwg.org/dwc/terms/eventDate": "02-07-1953",
"http://rs.tdwg.org/dwc/terms/georeferenceRemarks": null,
"http://rs.tdwg.org/dwc/terms/month": "02",
"http://rs.tdwg.org/dwc/terms/coordinateUncertaintyInMeters": "0.9940000000",
"http://rs.tdwg.org/dwc/terms/georeferenceProtocol": "MaNIS georeferencing guidelines",
"http://rs.tdwg.org/dwc/terms/specificEpithet": "FULVESCENS",
"http://rs.tdwg.org/dwc/terms/verbatimElevation": null,
"http://rs.tdwg.org/dwc/terms/geodeticDatum": "NAD27",
"http://rs.tdwg.org/dwc/terms/phylum": "Chordata",
"http://rs.tdwg.org/dwc/terms/preparations": "SKIN - 1|SKULL - 1",
"http://purl.org/dc/terms/modified": "2015-10-13",
"http://rs.tdwg.org/dwc/terms/order": "Rodentia",
"http://rs.tdwg.org/dwc/terms/year": "1953",
"http://rs.tdwg.org/dwc/terms/day": "07",
"http://rs.tdwg.org/dwc/terms/continent": "NORTH AMERICA",
"http://rs.tdwg.org/dwc/terms/establishmentMeans": "Native",
"http://rs.tdwg.org/dwc/terms/genus": "Reithrodontomys",
"http://rs.tdwg.org/dwc/terms/lifeStage": null,
"http://rs.tdwg.org/dwc/terms/county": null,
"http://rs.tdwg.org/dwc/terms/decimalLatitude": "19.9207047000",
"http://rs.tdwg.org/dwc/terms/reproductiveCondition": null,
"http://rs.tdwg.org/dwc/terms/georeferenceSources": "Localidades 2000, INEGI",
"http://rs.tdwg.org/dwc/terms/occurrenceStatus": "Present",
"http://rs.tdwg.org/dwc/terms/typeStatus": null,
"http://rs.tdwg.org/dwc/terms/higherClassification": null,
"http://rs.tdwg.org/dwc/terms/class": "Mammalia",
"http://rs.tdwg.org/dwc/terms/taxonRank": "subspecies",
"http://rs.tdwg.org/dwc/terms/sex": "FEMALE",
"http://rs.tdwg.org/dwc/terms/verbatimCoordinateSystem": "decimal degrees",
"http://rs.tdwg.org/dwc/terms/verbatimEventDate": "7 FEBRUARY 1953",
"http://rs.tdwg.org/dwc/terms/fieldNumber": null,
"http://rs.tdwg.org/dwc/terms/locality": "0.5MI NW MAZAMITLA",
"http://rs.tdwg.org/dwc/terms/infraspecificEpithet": "GRISEOFLAVUS"
}
Note that UMMZ record mentioned above is interpreted and indexed only it's genus Reithrodontomys, instead of, what I assume, its full scientific name subspecific name -
Reithrodontomys FULVESCENS GRISEOFLAVUS
from https://www.gbif.org/occurrence/1987272568 accessed at 2022-07-07
A rough estimate of just UMMZ Mammals records left only crudely indexed by GBIF would be:
$ preston ls | preston dwc-stream | jq --raw-output '.["http://rs.tdwg.org/dwc/terms/scientificName"]' | grep -P "[A-Z]{2,}" | sort | uniq -c | wc -l
1942
affect names associated with no more than $ preston ls | preston dwc-stream | jq --raw-output '. | select(.["http://www.w3.org/1999/02/22-rdf-syntax-ns#type"] =="http://rs.tdwg.org/dwc/terms/Occurrence") | .["http://rs.tdwg.org/dwc/terms/scientificName"]' | grep "[A-Z][A-Z]" | wc -l 91392 91k out of
$ preston ls | preston dwc-stream | jq --raw-output '. | select(.["http://www.w3.org/1999/02/22-rdf-syntax-ns#type"] =="http://rs.tdwg.org/dwc/terms/Occurrence") | .["http://rs.tdwg.org/dwc/terms/scientificName"]' | grep "[A-Za-z]" | wc -l
128087
128k specimen records, leaving potentially more than half of UMMZ Mammal records indexed by GBIF only to the genus level.
@mdoering I am noticing intermittent parsing failures on long names with all caps. I noticed related issues. Given that this would happen a bunch with millions of names, I wonder whether you are planning to resolve this issue?
Thanks for all your hard work in providing this neat modular java taxonomic name parser library.
related issues https://github.com/gbif/name-parser/issues/51 https://github.com/gbif/name-parser/issues/12
Do you have examples of such problem cases? The issue to linked to is about long authorships, but not all capital names.
@mdoering thanks for asking for examples:
You'll find one here:
This names appeared, as stated earlier, in https://github.com/globalbioticinteractions/nomer/issues/89#issue-1297909793 , in UMMZ mammals
found via:
preston track "https://ipt.lsa.umich.edu/archive.do?r=ummz_mammals"\
| preston dwc-stream\
| preston grep "Reithrodontomys FULVESCENS GRISEOFLAVUS"\
| head -n1\
| jq .
Given the one second timeout in the gbif code (spinning up a new timed job for each parse event), explains the intermittent nature of the crashes, especially when running on virtualized hardware like those used by GitHub actions.
I hope this example help - and am curious to hear your thoughts on this.
That is a tricky one as the genus is not capitalized. The capital epithets could as well be authors. Not sure if I can find a fix quickly.
The separate threads and timeouts are unfortunately needed and part of the design, as regexes can not be guaranteed to finish and there is no other way to stop a runaway regex than to kill its thread. You can increase the timeout in your code when using the parser, also to reflect your hardware. There will always be unparsable names though, sth to code for I am afraid.
@mdoering I can see how having to align millions of names with varying formatting due to local custom, existing data exports etc. I can see why you had to introduce the timeout for the parsing - to me this is one of these seemingly hack-y solution that actually conveys a ton of experience and pragmatism.
Thanks for taking the time to review.
Perhaps on thing to do here is to expect intermittent failures somehow, and by pre-processing known funny suspect names. These names are in the record forever, so we'll have to find a way to make do.
Yes. And Ill try my best to improve the parser (which has that preprocessing discovery already for various problem cases). This case is just tricky because even as a human it is not entirely obvious what it is. I guess the latin endings convince us that these all caps words are epithets not authors?
But I wanted to point out the UnparsableException thrown due to timeout or other reason is part of the parsers interface contract. This is mostly because of unpredictable regex performance https://bugs.openjdk.org/browse/JDK-8260688
echo -e "\tReithrodontomys FULVESCENS GRISEOFLAVUS"\
| nomer append gbif-parse
now producing a
[main] WARN org.gbif.nameparser.NameParserGBIF - Parsing timeout for: Reithrodontomys FULVESCENS GRISEOFLAVUS
Reithrodontomys FULVESCENS GRISEOFLAVUS SAME_AS Reithrodontomys FULVESCENS GRISEOFLAVUS
With current version of Nomer/gbif-parser, a name that causes a timeout is returned. Neat to see that there's a specific timeout exception thrown, informing the API called of the issue at hand.
| nomer append gbif-parse
providedExternalId | providedName | relationName | resolvedExternalId | resolvedName | resolvedAuthorship | resolvedRank | resolvedCommonNames | resolvedPath | resolvedPathIds | resolvedPathNames | resolvedPathAuthorships | resolvedExternalUrl |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Reithrodontomys FULVESCENS GRISEOFLAVUS | SAME_AS | Reithrodontomys | Fulvescens Griseoflavus | unranked |
to reproduce: