plazi / treatmentBank

Repository devoted to house keeping of treatmentBank
0 stars 0 forks source link

taxonomic names in latest treatments #39

Closed millerjeremya closed 2 years ago

millerjeremya commented 2 years ago

I was showing off the latest treatments on Plazi today (2022-03-08 (679)) to some colleagues. https://tb.plazi.org/GgServer/static/newToday.html I noticed that several species are missing their epithets image A couple lacked higher taxon information image Several Drosophila were misclassified as fungi, and there were problems with species epithets missing their genera. image

myrmoteras commented 2 years ago

This is the automated step - POA will do the QC and fix it.

the adiostola etc are species groups we have not yet properly figured out how to deal with - but we need to.

flsimoes commented 2 years ago

We're on to it.

I don't know why, but the authors added a full-stop in-between the name parts... image

image

millerjeremya commented 2 years ago

What a strange formatting choice. Thanks everyone!

gsautter commented 2 years ago

The "formatting choice" with the period between genus and species got me curious, so I just looked at the source PDF, and it doesn't seem to have this period ... might be a highly pathological font decoding anomaly ... will take a look at the PDF after my current Skype.

flsimoes commented 2 years ago

The "formatting choice" with the period between genus and species got me curious, so I just looked at the source PDF, and it doesn't seem to have this period ... might be a highly pathological font decoding anomaly ... will take a look at the PDF after my current Skype.

Thanks for looking at that Guido

gsautter commented 2 years ago

Reproduced the font decoding glitch ... most pathological case so far, I guess ... digging in.

gsautter commented 2 years ago

Looks like an issue in the Unicode mapping ... something weird is going on there, most likely to the avail of conflating two characters.

gsautter commented 2 years ago

IMF UUID (to have it handy): FFB269399C4D702CD44FFFD9E928FFB2

gsautter commented 2 years ago

Looks like an issue in the Unicode mapping ... something weird is going on there, most likely to the avail of conflating two characters.

Turns out the Unicode mapping does actually map 0x20 (space) to 0x002E (period) ... need to figure out what to do about this.

gsautter commented 2 years ago

Managed to filter out the faulty mapping now (with a one-off catch in this freak instance), re-decoded the source, uploaded the corrected IMF to the server, and re-ran the batch ... free to proceed with QC as normal now.