shenwei356 / gtdb-taxdump

GTDB taxonomy taxdump files with trackable TaxIds
MIT License
47 stars 2 forks source link

Getting GTDB taxID for newly generated MAGs #7

Open PeterCx opened 1 year ago

PeterCx commented 1 year ago

Hi there,

I have MAGs is generated from my own samples and annotated using GTDB-Tk. They have been de-replicated leaving me with a unique set which are not genetically close to any reference genome in GTDBr207. Can I use this tool to get taxIDs for my MAGs? I want to build a custom database with my MAGs and GTDB and for this I require taxIDs.

Its not clear to me if this is possible using this tool. Your help is greatly appreciated.

Kind regards,

P

shenwei356 commented 1 year ago

Sure. This tutorial could be a reference.

Steps:

  1. Exporting taxonomic lineages of taxa with rank equal to species from GTDB-taxdump, into tabular format.

    taxonkit list --data-dir gtdb-taxdump/R207/ --ids 1 --indent "" \
        | taxonkit filter --data-dir gtdb-taxdump/R207/ --equal-to species \
        | taxonkit reformat --data-dir gtdb-taxdump/R207/ --taxid-field 1 \
            --format "{k}\t{p}\t{c}\t{o}\t{f}\t{g}\t{s}" \
            -o gtdb.tsv
  2. For the new MAGs, you need to prepare the full lineages in 7-column tabular format. You may create new species names.

    # custom.tsv
    $ cat custom.tsv 
    A       B       C       D       E       F       G
  3. Creating taxdump from lineages above.

    (cut -f 2- gtdb.tsv; cat custom.tsv) \
        | taxonkit create-taxdump \
            -R "superkingdom,phylum,class,order,family,genus,species" \
            -O taxdump
  4. Some tests.

    $ echo G | taxonkit name2taxid --data-dir taxdump/
    G       1630414510
    
    $ echo 1630414510 | taxonkit lineage --data-dir taxdump/ -r
    1630414510      A;B;C;D;E;F;G   species
PeterCx commented 1 year ago

Thank you very much. This has worked. In that I have been able to generate TaxID for my MAGs.

However I think something is incorrect. I was using a custom taxdump that was generated previously. I was assuming this taxdump would be updated to include my MAGs. Instead a different taxdump is generated which is much smaller in size, although it does contain my MAGs. See the files below. Not sure what the problem is.

Old_TaxDump_Names.dmp.txt New_TaxDump_Names.dmp.txt

Your help is appreciated.

Kind regards,

P

shenwei356 commented 1 year ago

So how was the old one generated?

PeterCx commented 1 year ago

It was downloaded from here:

http://ftp.tue.mpg.de/ebio/projects/struo2/GTDB_release207/taxdump/

I am using another tool called Struo2 to generate a Kraken2 GTDB database and then update it with my custom MAGs. The creator of this tool says he produced this specific taxdump using your tool.

shenwei356 commented 1 year ago

I see, the taxdump files in struo2 should be generated by @nick-youngblut with old versions of TaxonKit, which might produce different TaxId values for the same lineage.

I'm not sure whether @nick-youngblut performed other transforms cause TaxonKit began to save taxIds in int32 instead of uint32 since v0.14.0 (Nov 28, 2022), as BLAST and DIAMOND do since v0.14.0 (Nov 28, 2022).

Or you can also try v0.12.0, which should be the version he used, and regenerate new taxdump files.

nick-youngblut commented 1 year ago

I'm not sure whether @nick-youngblut performed other transforms cause TaxonKit began to save taxIds in int32 instead of uint32 since v0.14.0 (Nov 28, 2022), as BLAST and DIAMOND do since v0.14.0 (Nov 28, 2022).

I did not conduct any transformations. I used the taxdump as-is for GTDB-r207. The taxdump files were downloaded in June 2022.