nick-youngblut / gtdb_to_taxdump

Convert GTDB taxonomy to NCBI taxdump format
MIT License
65 stars 13 forks source link

ncbi-gtdb_map.py (GTDB => NCBI) results in all NAs #20

Closed Somebodyatthdoor closed 1 year ago

Somebodyatthdoor commented 1 year ago

Hi,

I am having a problem converting GTDB taxonomies to NCBI taxonomies. When I run this command I only get NA results in my output file.

I first run wget to get the necessary GTDB files (I have to do this due to some security related issues with the server I am on):

wget https://data.ace.uq.edu.au/public/gtdb/data/releases/release95/95.0/ar122_metadata_r95.tar.gz wget https://data.ace.uq.edu.au/public/gtdb/data/releases/release95/95.0/bac120_metadata_r95.tar.gz tar -xvzf ar122_metadata_r95.tar.gz tar -xvzf bac120_metadata_r95.tar.gz ncbi-gtdb_map.py -q gtdb_taxonomy -o taxonomies Species_gtdb_lineages.txt ar122_metadata_r95.tsv bac120_metadata_r95.tsv

Input file: Species_gtdb_lineages.txt

Output file: taxonomy_map_summary.txt

The GTDB IDs that I have in my input file are present in the metadata files, so I am unsure what it is I am doing wrong.

Thank you very much for making this tool, Laura

nick-youngblut commented 1 year ago

Can you please share how your Species_gtdb_lineages.txt file is formatted?

By default ncbi-gtdb_map.py assumes that the column containing the queries (taxonomies) is the first column in the table.

Also, are you using --no-prefix?

Somebodyatthdoor commented 1 year ago

Hi, The file is a single column (no header), with the full gtdb classifications: Species_gtdb_lineages.txt

I have tried running the command both with and without the --no-prefix flag. But I always get no hits: 2023-02-09 08:58:23,147 - Loading: ar122_metadata_r95.tsv 2023-02-09 08:58:23,388 - Entries lacking an NCBI taxonomy: 153 2023-02-09 08:58:23,388 - Completeness-filtered entries: 1 2023-02-09 08:58:23,388 - Contamination-filtered entries: 69 2023-02-09 08:58:23,389 - Entries used: 2850 2023-02-09 08:58:23,389 - Loading: bac120_metadata_r95.tsv 2023-02-09 08:58:38,645 - Entries lacking an NCBI taxonomy: 0 2023-02-09 08:58:38,645 - Completeness-filtered entries: 17 2023-02-09 08:58:38,646 - Contamination-filtered entries: 2253 2023-02-09 08:58:38,646 - Entries used: 189257 2023-02-09 08:58:38,646 - Reading in queries: Species_gtdb_lineages.txt 2023-02-09 08:58:38,649 - No. of queries: 1790 2023-02-09 08:58:38,649 - No. of de-rep queries: 628 2023-02-09 08:58:38,649 - Batching queries... 2023-02-09 08:58:38,649 - No. of batches: 1 2023-02-09 08:58:38,650 - Queries per batch: 628 2023-02-09 08:58:38,650 - Querying taxonomies... 2023-02-09 08:58:38,652 - PID27975: Finished! Queries=628, Hits=0, No-Hits=628 2023-02-09 08:58:38,655 - File written: taxonomies/taxonomy_map_summary.tsv

Cheers, Laura

nick-youngblut commented 1 year ago

You need to use just one taxonomic level for the queries. See https://github.com/nick-youngblut/gtdb_to_taxdump/blob/master/tests/data/ncbi-gtdb/ncbi_tax_queries.txt for an example. I'll update the docs to clarify.

Somebodyatthdoor commented 1 year ago

Brilliant, thanks that's solved it.