nick-youngblut / gtdb_to_taxdump

Convert GTDB taxonomy to NCBI taxdump format
MIT License
66 stars 13 forks source link

KeyError: 'orig_name' #5

Closed Askarbek-orakov closed 3 years ago

Askarbek-orakov commented 3 years ago

Hi @nick-youngblut,

I encountered this type of error and here is a reproducible example.

python ncbi-gtdb_map.py -q gtdb_taxonomy <(echo "s__Aciduliprofundum boonei") https://data.ace.uq.edu.au/public/gtdb/data/releases/release95/95.0/ar122_metadata_r95.tar.gz https://data.ace.uq.edu.au/public/gtdb/data/releases/release95/95.0/bac120_metadata_r95.tar.gz
2021-01-25 00:22:07,239 - Loading: https://data.ace.uq.edu.au/public/gtdb/data/releases/release95/95.0/ar122_metadata_r95.tar.gz
2021-01-25 00:22:11,626 -   Completeness-filtered entries: 1
2021-01-25 00:22:11,626 -   Contamination-filtered entries: 70
2021-01-25 00:22:11,626 -   Entries used: 3002
2021-01-25 00:22:11,626 - Loading: https://data.ace.uq.edu.au/public/gtdb/data/releases/release95/95.0/bac120_metadata_r95.tar.gz
2021-01-25 00:22:55,409 -   Completeness-filtered entries: 17
2021-01-25 00:22:55,409 -   Contamination-filtered entries: 2253
2021-01-25 00:22:55,409 -   Entries used: 189257
2021-01-25 00:22:55,409 - Reading in queries: /dev/fd/63
2021-01-25 00:22:55,410 - No. of queries: 1
2021-01-25 00:22:55,410 - No. of de-rep queries: 1
2021-01-25 00:22:55,410 -   No. of batches: 1
2021-01-25 00:22:55,410 -   Queries per batch: 1
2021-01-25 00:22:55,410 - Querying taxonomies...
Traceback (most recent call last):
  File "ncbi-gtdb_map.py", line 568, in <module>
    main(args)
  File "ncbi-gtdb_map.py", line 564, in main
    write_table(idx, args.outdir, qtax=args.query_taxonomy)
  File "ncbi-gtdb_map.py", line 487, in write_table
    for x in idx:
  File "ncbi-gtdb_map.py", line 373, in _query_tax
    LCA = lca_many_nodes(G[ttax], tips, lca_frac=lca_frac)
  File "ncbi-gtdb_map.py", line 347, in lca_many_nodes
    lca[0] = G.nodes[lca[0]]['orig_name']
KeyError: 'orig_name'

I guess it is due to missing levels in ncbi taxonomy in GTDB metadata file. -> dArchaea;pEuryarchaeota;c;o;f;gAciduliprofundum;s__Aciduliprofundum boonei

Also, this example doesn't seem to be the only kind reason for such an error. Do you what else could result in such error?

Thank you!

nick-youngblut commented 3 years ago

Thanks for pointing out this edge case! One of the s__Aciduliprofundum boonei genomes only has none for the NCBI_taxonomy, so the LCA was root which caused the KeyError. I'll have the script filter out all records with "none" for the ncbi taxonomy.

nick-youngblut commented 3 years ago

It should now be fixed