shenwei356 / taxonkit

A Practical and Efficient NCBI Taxonomy Toolkit, also supports creating NCBI-style taxdump files for custom taxonomies like GTDB/ICTV
https://bioinf.shenwei.me/taxonkit
MIT License
378 stars 30 forks source link

Sequence names starting with "Sp_" #65

Closed snayfach closed 2 years ago

snayfach commented 2 years ago

I built a taxdump using a custom taxonomy with the command: taxonkit create-taxdump genome_taxonomy.tsv -A 1 -O out--force

A few of the accessions in genome_taxonomy.tsv start with "Sp_" and I noticed this prefix was removed in the taxid.map output file causing some issues.

I'll find a workaround, but thought you might want to know

shenwei356 commented 2 years ago

Thank you, Stephen.

There's a bug when using the command you used, the colname of the accession column would be treated as one of the ranks, which messed up all the ranks. I've fixed it but haven't released it yet. Please use the binary here:

- fix bug of handling non-GTDB data when using `-A/--field-accession` and no rank names given.

But it seems not the issue you met. Can you paste some data to reproduce?

shenwei356 commented 2 years ago

I figure out what happend. Please wait for a few minutes.

shenwei356 commented 2 years ago

Fixed. The old default regular expression ^\w\w_(.+)$ wrongly removed the Sp_ prefix, which is meant to remove the prefix GB_ or RS_ of GB_GCA_001941065.1 in GTDB taxnomy data. Now it's changed:

--field-accession-re string       regular expression to extract assembly accession (default "^(.+)$")

Also fix the command to create taxdump from MGV data

snayfach commented 2 years ago

Fixed! ... An unrelated question I was hoping you could answer: how should I format the input file for sequences that are unclassified at a given rank? Can I use "unclassified" or an empty string "" or do I need to include the parent taxon e.g. "unclassified_proteobacteria"?

shenwei356 commented 2 years ago

Just leave it blank (empty string ""), the accession would point to the closest node above the node in taxid.map

$ cat taxonomy.tsv  | csvtk pretty -t
id                superkingdom   phylum       class     order        family              genus            species
---------------   ------------   ----------   -------   ----------   -----------------   --------------   ---------------------
GCF_001027105.1   Bacteria       Firmicutes   Bacilli   Bacillales   Staphylococcaceae   Staphylococcus   Staphylococcus aureus
test              Bacteria       Firmicutes   Bacilli   Bacillales   Staphylococcaceae   Staphylococcus   

$ taxonkit create-taxdump -A 1 taxonomy.tsv -O t --force

$ cat t/taxid.map  | taxonkit lineage --data-dir t/ -i 2 -t  | csvtk pretty -Ht
GCF_001027105.1   1569132721   Bacteria;Firmicutes;Bacilli;Bacillales;Staphylococcaceae;Staphylococcus;Staphylococcus aureus   609216830;3642462009;1845768359;813944714;1997712377;1824050977;1569132721
test              1824050977   Bacteria;Firmicutes;Bacilli;Bacillales;Staphylococcaceae;Staphylococcus                         609216830;3642462009;1845768359;813944714;1997712377;1824050977