rrwick / Metagenomics-Index-Correction

GNU General Public License v3.0
79 stars 9 forks source link

tax_from_gtdb - prepare sequence files error #8

Open Alxdu opened 4 years ago

Alxdu commented 4 years ago

tax_from_gtdb.py did not prepare all the genome sequence .fna files avaialable in the genome/ folder.

I downloaded and extracted the latest GTDB reference database (24706 genome .fna files) into a genome/ folder. I also concatenated the bacteria+archaea taxonony .tsv files into taxonomy.tsv.

I ran the recommended input and output command: $ ./tax_from_gtdb.py --gtdb taxonomy.tsv --assemblies genomes --nodes centrifuge_gtdb/gtdb.tree --names centrifuge_gtdb/gtdb.name --conversion centrifuge_gtdb/gtdb.conv --cat_fasta centrifuge_gtdb/gtdb.fa

The issue is that the .conv and .fa output files only compile the information from genomic .fna files that start with UBA (e.g., UBA9354_genomic.fna) while all files that begin with RS_GCF.... or GB_GCA... are ignored (e.g., files such as RS_GCF_001692445.1_genomic.fna and GB_GCA_002718915.1_genomic.fna). The .tree and .name files appear to be complete.

PS: The same issue occurs when preparing files for building the Kraken2 database (i.e., all the files starting with RS or GB are not processed, while files that start with UBA are processed)

PS: sample .conv file attached gtdb.conv.zip

ilnamkang commented 4 years ago

Hi,

I'm not involved in this project, so I'm not sure this approach is right or not.

Anyway, I ran 'tax_from_gtdb.py' successfully using the GTDB-Tk data file, which is available at https://data.ace.uq.edu.au/public/gtdb/data/releases/latest/gtdbtk_r89_data.tar.gz

After decompressing the downloaded file, I ran 'tax_from_gtdb.py' as below. $ ./tax_from_gtdb.py --gtdb /gtdbtk/release89/taxonomy/gtdb_taxonomy.tsv --assemblies /gtdbtk/release89/fastani --nodes r89.tree --names r89.name --conversion r89.conv --cat_fasta r89.fa

The input "gtdb_taxonomy.tsv" file has 24,706 lines, and the "fastani" directory has 24,706 *.fna.gz files.

The resulting .conv and .fa files seem to have information of all 24,706 genomes.