Closed rdmpage closed 1 year ago
Hello, dr. Page!
This article is part of our daily processing, but it had not yet gone through the data quality analysis. I already fixed the paper. Thank you for the warning! Let me know if there is any other issue regarding this one.
Sincerely, Jonas.
Hi @JonasBlanco thanks for the speedy response. Does this mean that anything older than a day has been through quality control? I'm asking because I can see errors in records from, say, 2022-05-17.
Hi @JonasBlanco thanks for the speedy response. Does this mean that anything older than a day has been through quality control? I'm asking because I can see errors in records from, say, 2022-05-17.
No, there are bound to be errors here and there. But we are making an effort to run as much QC as we possibly can on a daily basis.
I recently tweeted about this example https://twitter.com/rdmpage/status/1528776518700347393 For the paper "Quararibea alversonii (Malvaceae; Malvoideae), a new species from the Brazilian Atlantic Forest" https://doi.org/10.11646/phytotaxa.547.2.9 Plazi manages to extract four different occurrences from the same record. Based on Plazi's extract I think the original record is:
This record is for a holotype and two isotopes, but Plazi breaks this into four records, some with incomplete data, and all have either imaginary or composite collection codes, based on author initials, etc.
You can see a visualisation of this here 03CE87C36C78C034BDB4FDCD03B113CF (screenshot below).
To make things worse, the actual specimens already exist in GBIF:
So the end result of this is that GBIF now contains at least four bogus records for this species (I haven't checked the rest extracted from this paper), when many of these specimens already exist.
I feel like a bit of a broken record (e.g. https://github.com/plazi/community/issues/94 ) but Plazi's parsing of specimens, especially plant specimens, is often horribly inaccurate. This example is not a one-off, it happens regularly, to the point where I routinely ignore Plazi-derived data in GBIF searches because it is likely to be error-strewn. There is so much potential here if the specimen parsing was improved to avoid generating bogus data.