njdowdy / tpt-taxonomy

Foundational taxonomic resources for the TPT project
GNU General Public License v3.0
6 stars 1 forks source link

suggest to make headers for -standardized-v2.csv files consistent #13

Closed jhpoelen closed 1 year ago

jhpoelen commented 1 year ago

@njdowdy @EMTuckerLab et al. Thanks for providing tpt taxonomy in similar file formats.

When looking at the unique headers of the -standardized-v2.csv file, I'd expect identical headers, but instead, I found:

$ find . -type f | grep standardized-v2.csv | xargs -L1 head -n1 | sort | uniq 
source,taxonID,scientificNameID,acceptedNameUsageID,parentNameUsageID,originalNameUsageID,nameAccordingToID,namePublishedInID,taxonConceptID,scientificName,acceptedNameUsage,parentNameUsage,originalNameUsage,nameAccordingTo,namePublishedIn,namePublishedInYear,higherClassification,kingdom,phylum,class,subclass,superorder,order,suborder,infraorder,parvorder,nanorder,superfamily,family,subfamily,tribe,subtribe,genus,infragenericEpithet,specificEpithet,infraspecificEpithet,taxonRank,verbatimTaxonRank,scientificNameAuthorship,vernacularName,nomenclaturalCode,taxonomicStatus,nomenclaturalStatus,taxonRemarks,canonical
source,taxonID,scientificNameID,acceptedNameUsageID,parentNameUsageID,originalNameUsageID,nameAccordingToID,namePublishedInID,taxonConceptID,scientificName,acceptedNameUsage,parentNameUsage,originalNameUsage,nameAccordingTo,namePublishedIn,namePublishedInYear,higherClassification,kingdom,phylum,class,subclass,superorder,order,suborder,infraorder,parvorder,nanorder,superfamily,family,subfamily,tribe,subtribe,genus,infragenericEpithet,specificEpithet,infraspecificEpithet,taxonRank,verbatimTaxonRank,scientificNameAuthorship,vernacularName,nomenclaturalCode,taxonomicStatus,nomenclaturalStatus,taxonRemarks,canonicalName
source,taxonID,scientificNameID,acceptedNameUsageID,parentNameUsageID,originalNameUsageID,nameAccordingToID,namePublishedInID,taxonConceptID,scientificName,acceptedNameUsage,parentNameUsage,originalNameUsage,nameAccordingTo,namePublishedIn,namePublishedInYear,higherClassification,kingdom,phylum,class,subclass,superorder,order,suborder,infraorder,parvorder,nanorder,superfamily,family,subfamily,tribe,subtribe,genus,infragenericEpithet,specificEpithet,infraspecificEpithet,taxonRank,verbatimTaxonRank,scientificNameAuthorship,vernacularName,nomenclaturalCode,taxonomicStatus,nomenclaturalStatus,taxonRemarks,canonicalName

meaning that there's three different flavors of standardized v2 headers. Is this expected ?

jhpoelen commented 1 year ago

The root cause of the header differences appear to be:

  1. host_files/Mammalia_standardized-v2.csv is using DOS line endings (i.e. \r\n) whereas all others use UNIX line endings (i.e. \n)
  2. Ixodida/Ixodida-standardized-v2.csv and Phthiraptera/Phthiraptera-standardized-v2.csv use canonical instead of canonicalName
jhpoelen commented 1 year ago

@EMTuckerLab Thanks for merging my pull request to resolve this issue.