jhpoelen / cite-the-bunnies

proof of concept for citing bunny records found in a version of GBIF/iDigBio
0 stars 0 forks source link

suspicious escaping found in dynamic properties field collection Academy of Natural Sciences of Drexel University - Lichens also identified by https://lichenportal.org/portal/collections/misc/collprofiles.php?collid=33 #1

Open jhpoelen opened 1 month ago

jhpoelen commented 1 month ago

As I was chasing down bunnies (Sylvilagus floridanus) as part of a ongoing discussion at https://discourse.gbif.org/t/when-does-evidence-of-impact-become-too-onerous-to-track/4639/11, I stumbled across a darwin core archive with content id hash://sha256/f65803cec348a1ecdb968be103d6e77e101217793339f4412377e1d3a0f36f38 that was found in an Oct 2024 copy of GBIF/iDigBio registered datasets with id hash://sha256/fc6a81d2968272005c947df326ac3e99fd4db9b585198d64208c52475e9c1dc8 .

https://linker.bio/line:zip:hash://sha256/f65803cec348a1ecdb968be103d6e77e101217793339f4412377e1d3a0f36f38!/occurrences.csv!/L1,L19216

as associated with https://lichenportal.org/portal/collections/individual/index.php?occid=4818286 , includes

"exsiccatae: {""exsTitle"":""Lichenes Rariores Veneti \\"additis nonnullis speciebus ex vicinis regionibus\\" quos ex herbario Massalongiano in continuationem lichenum Italiae exsiccatorum excerpsit evulgavitque presb. Martinus Anzi eq. Mauritianus in seminario Novo-Comensi professor"",""exsAbbreviation"":""Anzi, Lich. Rar. Veneti [Novi-Comi]"",""exsRange"":""1-175"",""exsEditor"":""Martino Anzi"",""exsNumber"":""11""}"

For some reason, the \\"additis was not escaped as expected with a double quote.

See included suspicious.csv for the relevant section including the header.

When replaced with \""additis, and similar, a valid csv appears to be produced.

See included suspicious-updated.csv for an edited record with escaping that would be a little friendlier to csv parsers.

the related dataset is described by eml defined by:

https://linker.bio/zip:hash://sha256/f65803cec348a1ecdb968be103d6e77e101217793339f4412377e1d3a0f36f38!/eml.xml

being

<?xml version="1.0" encoding="UTF-8"?>
<eml:eml xmlns:eml="eml://ecoinformatics.org/eml-2.1.1" xmlns:dc="http://purl.org/dc/terms/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="eml://ecoinformatics.org/eml-2.1.1 http://rs.gbif.org/schema/eml-gbif-profile/1.0.1/eml.xsd" packageId="4228ac8d-7b93-4fd6-b623-9b4d62800885" system="https://symbiota.org" scope="system" xml:lang="eng">
  <dataset>
    <alternateIdentifier>https://lichenportal.org/portal/collections/misc/collprofiles.php?collid=33</alternateIdentifier>
    <title xml:lang="eng">Academy of Natural Sciences of Drexel University - Lichens</title>
    <creator id="5fe21305-e0fb-4af6-b838-44ed2b95378e">
      <organizationName>Consortium of Lichen Herbaria</organizationName>
      <electronicMailAddress>CNALH.help@gmail.com</electronicMailAddress>
      <onlineUrl>https://lichenportal.org/portal/index.php</onlineUrl>
    </creator>
    <metadataProvider>
      <organizationName>Consortium of Lichen Herbaria</organizationName>
      <electronicMailAddress>CNALH.help@gmail.com</electronicMailAddress>
      <onlineUrl>https://lichenportal.org/portal/index.php</onlineUrl>
    </metadataProvider>
    <pubDate>2024-09-08</pubDate>
    <language>eng</language>
    <abstract>
      <para>PH (the botanical herbarium of the Academy of Natural Sciences) is the oldest institutional herbarium in the United States. It is a national resource for material from 1750-1850. The diatom herbarium (ANSP) is managed separately.</para>
    </abstract>
    <contact>
      <organizationName>Academy of Natural Sciences of Drexel University - Lichens</organizationName>
      <phone/>
      <electronicMailAddress>ans_ph_herbarium@drexel.edu</electronicMailAddress>
      <onlineUrl>http://www.ansp.org/research/systematics-evolution/collections/botany/</onlineUrl>
      <addr>
        <deliveryPoint>Botany Department, 1900 Benjamin Franklin Parkway</deliveryPoint>
        <city>Philadelphia</city>
        <administrativeArea>PA</administrativeArea>
        <postalCode>19103</postalCode>
        <country>United States</country>
      </addr>
    </contact>
    <associatedParty>
      <individualName>
        <surName>Smith</surName>
        <givenName>Chelsea</givenName>
      </individualName>
      <electronicMailAddress>ans_ph_herbarium@drexel.edu</electronicMailAddress>
      <positionName>Collection Manager</positionName>
      <role>contentProvider</role>
    </associatedParty>
    <intellectualRights>
      <para>To the extent possible under law, the publisher has waived all rights to these data and has dedicated them to the <ulink url="http://creativecommons.org/licenses/by-nc/3.0/"><citetitle/></ulink></para>
    </intellectualRights>
  </dataset>
  <additionalMetadata>
    <metadata>
      <symbiota id="5fe21305-e0fb-4af6-b838-44ed2b95378e">
        <dateStamp>2024-09-08T07:46:16-07:00</dateStamp>
        <citation identifier="079a6ec0-d004-4839-ba8a-476337e4b42b">Consortium of Lichen Herbaria - 079a6ec0-d004-4839-ba8a-476337e4b42b</citation>
        <physical>
          <characterEncoding>UTF-8</characterEncoding>
          <dataFormat>
            <externallyDefinedFormat>
              <formatName>Darwin Core Archive</formatName>
            </externallyDefinedFormat>
          </dataFormat>
        </physical>
        <collection identifier="cb974399-6966-4d9a-9755-515c9c5d0929" id="33">
          <alternateIdentifier>https://lichenportal.org/portal/collections/misc/collprofiles.php?collid=33</alternateIdentifier>
          <parentCollectionIdentifier>PH</parentCollectionIdentifier>
          <collectionIdentifier/>
          <collectionName>Academy of Natural Sciences of Drexel University - Lichens</collectionName>
          <resourceLogoUrl>https://lichenportal.org/cnalh/content/collicon/ph.jpg</resourceLogoUrl>
          <onlineUrl>http://www.ansp.org/research/systematics-evolution/collections/botany/</onlineUrl>
          <intellectualRights>http://creativecommons.org/licenses/by-nc/3.0/</intellectualRights>
          <associatedParty>
            <individualName>
              <surName>Smith</surName>
              <givenName>Chelsea</givenName>
            </individualName>
            <electronicMailAddress>ans_ph_herbarium@drexel.edu</electronicMailAddress>
            <positionName>Collection Manager</positionName>
          </associatedParty>
          <abstract>
            <para>&lt;p&gt;PH (the botanical herbarium of the Academy of Natural Sciences) is the oldest institutional herbarium in the United States. It is a national resource for material from 1750-1850. The diatom herbarium (ANSP) is managed separately.&lt;/p&gt;</para>
          </abstract>
        </collection>
      </symbiota>
    </metadata>
  </additionalMetadata>
</eml:eml>

fyi @nicofranz @themerekat - an example of how data reviews on data with known provenance can help pinpoint very specific issues.

jhpoelen commented 1 month ago

for demonstration -

when attempting to process the suspicious data with a well known csv processing tool, mlr

cat suspicious.csv | mlr --icsv --oxtab cat 

produced:

mlr: mlr: CSV header/data length mismatch 95 != 48 at filename (stdin) row 2.

however,

cat suspicious-updated.csv | mlr --icsv --oxtab cat

produced:

id                             4818286
institutionCode                PH
collectionCode                 
ownerInstitutionCode           
collectionID                   cb974399-6966-4d9a-9755-515c9c5d0929
basisOfRecord                  PreservedSpecimen
occurrenceID                   3352d191-35a0-4d8f-8c39-98dc57fcf24b
catalogNumber                  PH00792055
otherCatalogNumbers            
higherClassification           Fungi|Dikarya|Ascomycota|Pezizomycotina|Lecanoromycetes|Lecanoromycetidae|Peltigerales|Collematineae|Pannariaceae|Degelia
kingdom                        Fungi
phylum                         Ascomycota
class                          Lecanoromycetes
order                          Peltigerales
family                         Pannariaceae
scientificName                 Degelia plumbea
taxonID                        54246
scientificNameAuthorship       (Lightf.) P.M. Jørg. & P. James
genus                          Degelia
subgenus                       
specificEpithet                plumbea
verbatimTaxonRank              
infraspecificEpithet           
taxonRank                      Species
identifiedBy                   
dateIdentified                 
identificationReferences       
identificationRemarks          
taxonRemarks                   
identificationQualifier        
typeStatus                     
recordedBy                     M. Anzi
recordNumber                   s.n.
eventDate                      
year                           
month                          
day                            
startDayOfYear                 
endDayOfYear                   
verbatimEventDate              s.d.
occurrenceRemarks              Ex Herbarium of the Swedish Museum of Natural History (S); Lichen Herbarium of James C. Lendemer.; Lichenes Rariores Veneti "additis nonnullis speciebus ex vicinis regionibus" quos ex herbario Massalongiano in continuationem lichenum Italiae exsiccatorum excerpsit evulgavitque presb. Martinus Anzi eq. Mauritianus in seminario Novo-Comensi professor [Anzi, Lich. Rar. Veneti [Novi-Comi]], 1-175, Martino Anzi, exs #: 11
habitat                        in collibus vallibusque Etruriae [in the hills and valleys of Etruria]; ad arborum [to the trees], praesertim Castanearum, et Olearum, truncos [especially the trunks of Chestnuts and Olives]
behavior                       
vitality                       
fieldNumber                    
eventID                        
informationWithheld            
dataGeneralizations            
dynamicProperties              exsiccatae: {"exsTitle":"Lichenes Rariores Veneti \"additis nonnullis speciebus ex vicinis regionibus\" quos ex herbario Massalongiano in continuationem lichenum Italiae exsiccatorum excerpsit evulgavitque presb. Martinus Anzi eq. Mauritianus in seminario Novo-Comensi professor","exsAbbreviation":"Anzi, Lich. Rar. Veneti [Novi-Comi]","exsRange":"1-175","exsEditor":"Martino Anzi","exsNumber":"11"}
associatedOccurrences          
associatedSequences            
associatedTaxa                 
reproductiveCondition          
establishmentMeans             
lifeStage                      
sex                            
individualCount                
samplingProtocol               
preparations                   
locationID                     
continent                      
waterBody                      
islandGroup                    
island                         
country                        Italy
stateProvince                  
county                         
municipality                   
locality                       Monte Pisano, valli sopra Pistoja, Valle del Mugnone, Etruria
locationRemarks                
decimalLatitude                43.750016
decimalLongitude               10.550023
geodeticDatum                  WGS84
coordinateUncertaintyInMeters  6943
verbatimCoordinates            
georeferencedBy                aurbanz
georeferenceProtocol           
georeferenceSources            GeoLocate (CoGe)
georeferenceVerificationStatus 
georeferenceRemarks            Mounte Pisano, Tuscany, Italy. unc. covers the mountain
minimumElevationInMeters       
maximumElevationInMeters       
minimumDepthInMeters           
maximumDepthInMeters           
verbatimDepth                  
verbatimElevation              
disposition                    
language                       
recordEnteredBy                ebenamy
modified                       2024-03-23 13:26:16
rights                         http://creativecommons.org/licenses/by-nc/3.0/
rightsHolder                   
accessRights                   
recordID                       3352d191-35a0-4d8f-8c39-98dc57fcf24b
references                     https://lichenportal.org/portal/collections/individual/index.php?occid=4818286
jhpoelen commented 1 month ago

see similar issue at https://github.com/BioKIC/symbiota-docs/issues/441 reported on Dec 7, 2023.

themerekat commented 1 month ago

Thanks, Jorrit! With this clear description, I have added this problem to our list of bugs to fix: https://github.com/BioKIC/Symbiota/issues/1777