Open jhpoelen opened 1 month ago
for demonstration -
when attempting to process the suspicious data with a well known csv processing tool, mlr
cat suspicious.csv | mlr --icsv --oxtab cat
produced:
mlr: mlr: CSV header/data length mismatch 95 != 48 at filename (stdin) row 2.
however,
cat suspicious-updated.csv | mlr --icsv --oxtab cat
produced:
id 4818286
institutionCode PH
collectionCode
ownerInstitutionCode
collectionID cb974399-6966-4d9a-9755-515c9c5d0929
basisOfRecord PreservedSpecimen
occurrenceID 3352d191-35a0-4d8f-8c39-98dc57fcf24b
catalogNumber PH00792055
otherCatalogNumbers
higherClassification Fungi|Dikarya|Ascomycota|Pezizomycotina|Lecanoromycetes|Lecanoromycetidae|Peltigerales|Collematineae|Pannariaceae|Degelia
kingdom Fungi
phylum Ascomycota
class Lecanoromycetes
order Peltigerales
family Pannariaceae
scientificName Degelia plumbea
taxonID 54246
scientificNameAuthorship (Lightf.) P.M. Jørg. & P. James
genus Degelia
subgenus
specificEpithet plumbea
verbatimTaxonRank
infraspecificEpithet
taxonRank Species
identifiedBy
dateIdentified
identificationReferences
identificationRemarks
taxonRemarks
identificationQualifier
typeStatus
recordedBy M. Anzi
recordNumber s.n.
eventDate
year
month
day
startDayOfYear
endDayOfYear
verbatimEventDate s.d.
occurrenceRemarks Ex Herbarium of the Swedish Museum of Natural History (S); Lichen Herbarium of James C. Lendemer.; Lichenes Rariores Veneti "additis nonnullis speciebus ex vicinis regionibus" quos ex herbario Massalongiano in continuationem lichenum Italiae exsiccatorum excerpsit evulgavitque presb. Martinus Anzi eq. Mauritianus in seminario Novo-Comensi professor [Anzi, Lich. Rar. Veneti [Novi-Comi]], 1-175, Martino Anzi, exs #: 11
habitat in collibus vallibusque Etruriae [in the hills and valleys of Etruria]; ad arborum [to the trees], praesertim Castanearum, et Olearum, truncos [especially the trunks of Chestnuts and Olives]
behavior
vitality
fieldNumber
eventID
informationWithheld
dataGeneralizations
dynamicProperties exsiccatae: {"exsTitle":"Lichenes Rariores Veneti \"additis nonnullis speciebus ex vicinis regionibus\" quos ex herbario Massalongiano in continuationem lichenum Italiae exsiccatorum excerpsit evulgavitque presb. Martinus Anzi eq. Mauritianus in seminario Novo-Comensi professor","exsAbbreviation":"Anzi, Lich. Rar. Veneti [Novi-Comi]","exsRange":"1-175","exsEditor":"Martino Anzi","exsNumber":"11"}
associatedOccurrences
associatedSequences
associatedTaxa
reproductiveCondition
establishmentMeans
lifeStage
sex
individualCount
samplingProtocol
preparations
locationID
continent
waterBody
islandGroup
island
country Italy
stateProvince
county
municipality
locality Monte Pisano, valli sopra Pistoja, Valle del Mugnone, Etruria
locationRemarks
decimalLatitude 43.750016
decimalLongitude 10.550023
geodeticDatum WGS84
coordinateUncertaintyInMeters 6943
verbatimCoordinates
georeferencedBy aurbanz
georeferenceProtocol
georeferenceSources GeoLocate (CoGe)
georeferenceVerificationStatus
georeferenceRemarks Mounte Pisano, Tuscany, Italy. unc. covers the mountain
minimumElevationInMeters
maximumElevationInMeters
minimumDepthInMeters
maximumDepthInMeters
verbatimDepth
verbatimElevation
disposition
language
recordEnteredBy ebenamy
modified 2024-03-23 13:26:16
rights http://creativecommons.org/licenses/by-nc/3.0/
rightsHolder
accessRights
recordID 3352d191-35a0-4d8f-8c39-98dc57fcf24b
references https://lichenportal.org/portal/collections/individual/index.php?occid=4818286
see similar issue at https://github.com/BioKIC/symbiota-docs/issues/441 reported on Dec 7, 2023.
Thanks, Jorrit! With this clear description, I have added this problem to our list of bugs to fix: https://github.com/BioKIC/Symbiota/issues/1777
As I was chasing down bunnies (Sylvilagus floridanus) as part of a ongoing discussion at https://discourse.gbif.org/t/when-does-evidence-of-impact-become-too-onerous-to-track/4639/11, I stumbled across a darwin core archive with content id
hash://sha256/f65803cec348a1ecdb968be103d6e77e101217793339f4412377e1d3a0f36f38
that was found in an Oct 2024 copy of GBIF/iDigBio registered datasets with id hash://sha256/fc6a81d2968272005c947df326ac3e99fd4db9b585198d64208c52475e9c1dc8 .https://linker.bio/line:zip:hash://sha256/f65803cec348a1ecdb968be103d6e77e101217793339f4412377e1d3a0f36f38!/occurrences.csv!/L1,L19216
as associated with https://lichenportal.org/portal/collections/individual/index.php?occid=4818286 , includes
"exsiccatae: {""exsTitle"":""Lichenes Rariores Veneti \\"additis nonnullis speciebus ex vicinis regionibus\\" quos ex herbario Massalongiano in continuationem lichenum Italiae exsiccatorum excerpsit evulgavitque presb. Martinus Anzi eq. Mauritianus in seminario Novo-Comensi professor"",""exsAbbreviation"":""Anzi, Lich. Rar. Veneti [Novi-Comi]"",""exsRange"":""1-175"",""exsEditor"":""Martino Anzi"",""exsNumber"":""11""}"
For some reason, the
\\"additis
was not escaped as expected with a double quote.See included suspicious.csv for the relevant section including the header.
When replaced with
\""additis
, and similar, a valid csv appears to be produced.See included suspicious-updated.csv for an edited record with escaping that would be a little friendlier to csv parsers.
the related dataset is described by eml defined by:
https://linker.bio/zip:hash://sha256/f65803cec348a1ecdb968be103d6e77e101217793339f4412377e1d3a0f36f38!/eml.xml
being
fyi @nicofranz @themerekat - an example of how data reviews on data with known provenance can help pinpoint very specific issues.