njdowdy / tpt-taxonomy

Foundational taxonomic resources for the TPT project
GNU General Public License v3.0
6 stars 1 forks source link

taxonIDs are reused across names for mammal host taxonomy #18

Closed jhpoelen closed 1 year ago

jhpoelen commented 1 year ago

As far I understand, taxonIDs are meant to identify specific taxa.

And, in the tpt mammal host taxonomic is appears that a few taxonIDs are reused across many taxon names.

when generating a frequency table for distinct taxonID values, you'd expect taxonIDs to appear only once.

However, when running

curl --silent -L https://raw.githubusercontent.com/njdowdy/tpt-taxonomy/main/host_files/Mammalia-standardized-v2.csv \
 | mlr --csv cut -f taxonID\
 | sort\
 | uniq -c\
 | sort -nr

the results (shown below) indicate that 220 is used over 6k times, and 180, 140, 100 are also used more than once.

   6369 220
   1328 180
    170 140
     27 100
      1 taxonID

@njdowdy @EMTuckerLab curious to hear your ideas on the taxonID assignment of the TPT mammal host taxonomy.

jhpoelen commented 1 year ago

I've created a patch https://github.com/njdowdy/tpt-taxonomy/issues/18 . Please review and accept if you agree with changes.

jhpoelen commented 1 year ago

resolved via https://github.com/njdowdy/tpt-taxonomy/pull/20