Open davidvilanova opened 1 year ago
Great,
I have tried so far, see shot below.
The taxonomy is two columns , fist column is accession and seconds columns are taxonomy as follows:
129138 Bacteria;Proteobacteria;Gammaproteobacteria;Pseudomonadales;Pseudomonadaceae;Pseudomonas;Pseudomonas amygdali
What is wrong ? there are two columns in the taxonomy file.
Taxa should be tab-separated, as shown in Example 4.
Prepare lineages
$ echo -e "129138\tBacteria;Proteobacteria;Gammaproteobacteria;Pseudomonadales;Pseudomonadaceae;Pseudomonas;Pseudomonas amygdali" \
| sed 's/;/\t/g' \
| cut -f 2-8
Bacteria Proteobacteria Gammaproteobacteria Pseudomonadales Pseudomonadaceae Pseudomonas Pseudomonas amygdali
Create taxdump
$ echo -e "129138\tBacteria;Proteobacteria;Gammaproteobacteria;Pseudomonadales;Pseudomonadaceae;Pseudomonas;Pseudomonas amygdali" \
| sed 's/;/\t/g' \
| cut -f 2-8 \
| taxonkit create-taxdump -R "superkingdom,phylum,class,order,famil,genus,species" -O taxdump
Test
$ taxonkit list -n -r --ids 1 --data-dir taxdump/
1 [no rank] root
609216830 [superkingdom] Bacteria
1641076285 [phylum] Proteobacteria
329474883 [class] Gammaproteobacteria
86398254 [order] Pseudomonadales
1478401337 [famil] Pseudomonadaceae
1616653803 [genus] Pseudomonas
691281667 [species] Pseudomonas amygdali
If you need to map the Silver TaxIDs to the created ones.
$ echo -e "129138\tBacteria;Proteobacteria;Gammaproteobacteria;Pseudomonadales;Pseudomonadaceae;Pseudomonas;Pseudomonas amygdali" \
| sed 's/;/\t/g' \
| taxonkit create-taxdump -R "superkingdom,phylum,class,order,famil,genus,species" -O taxdump2 -A 1
$ tree taxdump2/
taxdump2/
├── delnodes.dmp
├── merged.dmp
├── names.dmp
├── nodes.dmp
└── taxid.map
$ cat taxdump2/taxid.map
129138 691281667
Great !!! I have generated the database, can i use now with taxonkit -lca ??
Sure, please follow the usage and examples.
I have the following problem. I have added a new database using the workflow above and generated the Silva to TaxIDs.
For a particular id (1824050977) i don´t have a corresponding mapping (see shot). However it is in names.dmp.
The taxid map was generated the above post with option "-A 1"
Why it is missing ?
It just maps custom IDs to TaxIDs of the taxa of the lowest rank, e.g. species.
If you need to map to taxa of other ranks, like the genus Staphylococcus. Here's the way, csvtk is needed.
cd taxdmp;
cat taxid.map \
| taxonkit lineage --data-dir . -i 2 -t -d , \
| cut -f 1,4 \
| csvtk unfold -Ht -f 2 -s ,
129138 609216830
129138 1641076285
129138 329474883
129138 86398254
129138 1478401337
129138 1616653803
129138 691281667
Great, One more comment. When using all ranks i will have a problem to get a proper taxonkit lca assignment because by default the upper levels will be picked up. That mean for a particular list of ids all with get a Bacteria assigment. Don´t know if that can be fixed ?
In this case for example the 609216830 that is a Streptoccus is matched to Bacteria. I would expect an assignation to the genus family/genus level at least.
So, mapping only IDs to taxa of species rank is reasonable.
Back to the previous concern, why did you want map IDs to the genus Staphylococcus?
For a particular id (1824050977) i don´t have a corresponding mapping (see shot)
Mapping to species is enough, cause the complete lineage of the species can be retrieved with the taxid of the species.
Could you please show how you query the LCA? What were the TaxIDs used? What's the direct purpose?
Yes sure, What i´m trying to achieve is to annotate with Silva a set of nanopore reads. For that purpose i´m mapping one metagenomic sample reads to the Silva 16S database with minimap allowing for error tolerance as nanopore reads tend to give an error rate. For each read mapped to silva i get the first 10 hits.
So for each read i use LCA to assign the upper common hit from the 10 hits, so for one read i get one hit.
Once i get that hit i would like to resolve the taxonomy (phylum,class...order.....species)
For some reads the resolution (because of the error rate) cannot be achieved at the species level, that´s why in some cases(like the Staphyloccocus case above) the best hit (output from LCA) is a genus. But could be family or order or class or phyla.
Does this makes sense?
So for each read i use LCA to assign the upper common hit from the 10 hits, so for one read i get one hit.
Good, here you've assigned LCA to each read.
Once i get that hit i would like to resolve the taxonomy (phylum,class...order.....species)
No, you can just use taxonkit lineage or taxonkit reformat -I to retrieve lineage via the LCAs, no matter what the rank they are. There's no need to query with the taxid.map
file.
Thanks for the update, there is one strange behaviour for lca. I was expecting LCA to go up to the staphylococcus genus level but it went up to bacilli. I assume something is wrong with my database.
Here is the pipeline to build the database:
wget https://www.arb-silva.de/fileadmin/silva_databases/release_138.1/Exports/SILVA_138.1_SSURef_NR99_tax_silva.fasta.gz
gunzip SILVA_138.1_SSURef_NR99_tax_silva.fasta.gz
infile="SILVA_138.1_SSURef_NR99_tax_silva.fasta"
#Remove unwanted taxa
zcat ${infile} | grep ">" |grep -v 'Archaea\Eukaryota\|unidentified\|metagenomes\|environmental samples' >tab_silva_taxonomy
sed -i 's/>//g' tab_silva_taxonomy #replace ">"
sed -i 's/ /@@/' tab_silva_taxonomy # replace first space with @@
sed -i 's/ /_/g' tab_silva_taxonomy # replace all spaces with "_" mainly on species names
sed -i 's/@@/'$'\t''/' tab_silva_taxonomy # restore tab after accession
sed -i 's/;/'$'\t''/g' tab_silva_taxonomy # replace commas with semicolons
awk 'NF>7' tab_silva_taxonomy | cut -f 1-8 > tmp
mv tmp tab_silva_taxonomy
taxonkit create-taxdump tab_silva_taxonomy -O taxdump --force -RR "superkingdom,phylum,class,order,family,genus,species" -A 1
Don´t what is wrong...
$ echo 883645126 1202746109 883645126 1956460793 883645126 1956460793 883645126 485431882 \
| sed -E 's/\s+/\n/g' \
| taxonkit lineage --data-dir taxdump/ -t \
| cut -f 3
609216830;1494978361;1845768359;813944714;671290804;690796498;883645126
609216830;1494978361;1845768359;813944714;95949142;368069282;1202746109
609216830;1494978361;1845768359;813944714;671290804;690796498;883645126
609216830;1494978361;1845768359;422354816;1997712377;1824050977;1956460793
609216830;1494978361;1845768359;813944714;671290804;690796498;883645126
609216830;1494978361;1845768359;422354816;1997712377;1824050977;1956460793
609216830;1494978361;1845768359;813944714;671290804;690796498;883645126
609216830;1494978361;1845768359;422354816;1997712377;1824050977;485431882
$ echo 883645126 1202746109 883645126 1956460793 883645126 1956460793 883645126 485431882 \
| sed -E 's/\s+/\n/g' \
| taxonkit lineage --data-dir taxdump/ -t \
| cut -f 2
Bacteria;Firmicutes;Bacilli;Bacillales;Bacillaceae;Bacillus;Streptococcus_pneumoniae
Bacteria;Firmicutes;Bacilli;Bacillales;Planococcaceae;Sporosarcina;Staphylococcus_saprophyticus
Bacteria;Firmicutes;Bacilli;Bacillales;Bacillaceae;Bacillus;Streptococcus_pneumoniae
Bacteria;Firmicutes;Bacilli;Staphylococcales;Staphylococcaceae;Staphylococcus;Staphylococcus_sciuri
Bacteria;Firmicutes;Bacilli;Bacillales;Bacillaceae;Bacillus;Streptococcus_pneumoniae
Bacteria;Firmicutes;Bacilli;Staphylococcales;Staphylococcaceae;Staphylococcus;Staphylococcus_sciuri
Bacteria;Firmicutes;Bacilli;Bacillales;Bacillaceae;Bacillus;Streptococcus_pneumoniae
Bacteria;Firmicutes;Bacilli;Staphylococcales;Staphylococcaceae;Staphylococcus;Staphylococcus_sp._mixed_culture_J3-
$ echo 883645126 1202746109 883645126 1956460793 883645126 1956460793 883645126 485431882 \
| sed -E 's/\s+/\n/g' \
| sort | uniq \
| axonkit reformat --data-dir taxdump/ -I 1 -f '{k}\t{p}\t{c}\t{o}\t{f}\t{g}\t{s}' \
| csvtk add-header -t -n "taxid,kingdom,phylum,class,order,family,genus,species" \
| csvtk pretty -t
taxid kingdom phylum class order family genus species
---------- -------- ---------- ------- ---------------- ----------------- -------------- -------------------------------------
1202746109 Bacteria Firmicutes Bacilli Bacillales Planococcaceae Sporosarcina Staphylococcus_saprophyticus
1956460793 Bacteria Firmicutes Bacilli Staphylococcales Staphylococcaceae Staphylococcus Staphylococcus_sciuri
485431882 Bacteria Firmicutes Bacilli Staphylococcales Staphylococcaceae Staphylococcus Staphylococcus_sp._mixed_culture_J3-3
883645126 Bacteria Firmicutes Bacilli Bacillales Bacillaceae Bacillus Streptococcus_pneumoniae
The LCA seems to be 1845768359 (Bacilli), without a doubt.
BTW, the process could be simplified.
seqkit seq -n SILVA_138.1_SSURef_NR99_tax_silva.fasta.gz \
| grep -v 'Archaea\|Eukaryota\|unidentified\|metagenomes\|environmental samples' \
| sed 's/ /\t/' \
| sed 's/;/\t/g' \
| awk -F '\t' 'NF > 7' \
> silva_taxonomy.tsv
taxonkit create-taxdump silva_taxonomy.tsv -O taxdump \
--force -R "superkingdom,phylum,class,order,family,genus,species" -A 1
Hi, Thanks for the gtdb-taxdump. I´m working on Silva , is there any available tax dump ?