Closed snayfach closed 2 years ago
Thank you, Stephen.
There's a bug when using the command you used, the colname of the accession column would be treated as one of the ranks, which messed up all the ranks. I've fixed it but haven't released it yet. Please use the binary here:
- fix bug of handling non-GTDB data when using `-A/--field-accession` and no rank names given.
But it seems not the issue you met. Can you paste some data to reproduce?
I figure out what happend. Please wait for a few minutes.
Fixed. The old default regular expression ^\w\w_(.+)$
wrongly removed the Sp_
prefix, which is meant to remove the prefix GB_
or RS_
of GB_GCA_001941065.1
in GTDB taxnomy data. Now it's changed:
--field-accession-re string regular expression to extract assembly accession (default "^(.+)$")
Also fix the command to create taxdump from MGV data
Fixed! ... An unrelated question I was hoping you could answer: how should I format the input file for sequences that are unclassified at a given rank? Can I use "unclassified" or an empty string "" or do I need to include the parent taxon e.g. "unclassified_proteobacteria"?
Just leave it blank (empty string ""), the accession would point to the closest node above the node in taxid.map
$ cat taxonomy.tsv | csvtk pretty -t
id superkingdom phylum class order family genus species
--------------- ------------ ---------- ------- ---------- ----------------- -------------- ---------------------
GCF_001027105.1 Bacteria Firmicutes Bacilli Bacillales Staphylococcaceae Staphylococcus Staphylococcus aureus
test Bacteria Firmicutes Bacilli Bacillales Staphylococcaceae Staphylococcus
$ taxonkit create-taxdump -A 1 taxonomy.tsv -O t --force
$ cat t/taxid.map | taxonkit lineage --data-dir t/ -i 2 -t | csvtk pretty -Ht
GCF_001027105.1 1569132721 Bacteria;Firmicutes;Bacilli;Bacillales;Staphylococcaceae;Staphylococcus;Staphylococcus aureus 609216830;3642462009;1845768359;813944714;1997712377;1824050977;1569132721
test 1824050977 Bacteria;Firmicutes;Bacilli;Bacillales;Staphylococcaceae;Staphylococcus 609216830;3642462009;1845768359;813944714;1997712377;1824050977
I built a taxdump using a custom taxonomy with the command:
taxonkit create-taxdump genome_taxonomy.tsv -A 1 -O out--force
A few of the accessions in
genome_taxonomy.tsv
start with "Sp_" and I noticed this prefix was removed in thetaxid.map
output file causing some issues.I'll find a workaround, but thought you might want to know