nickjcroucher / gubbins

Rapid phylogenetic analysis of large samples of recombinant bacterial whole genome sequences using Gubbins
http://nickjcroucher.github.io/gubbins/
GNU General Public License v2.0
175 stars 51 forks source link

KeyError: Not all labels matched to taxa #349

Closed pgcudahy closed 2 years ago

pgcudahy commented 2 years ago

I've run gubbins successfully on my batch of >200 samples, but now when trying to add an outgroup to the bunch it runs for about 20 minutes before failing with KeyError: Not all labels matched to taxa

My command is

run_gubbins.py \
    -p Taiwan_MKansasii_outgroup \
    --threads 8 \
    --extensive-search \
    --outgroup M_abscessus_ATCC_19977 \
    Taiwan_MKansasii_outgroup.aln

And the labels on my alignment seem to be good

$ head Taiwan_MKansasii_outgroup.aln | grep ">"
>M_abscessus_ATCC_19977
>N200002_S396
>N200003_S397
>N200004_S398
>N200005_S399

But the full error I get is

Traceback (most recent call last):
  File "/gpfs/ysm/project/cudahy/pgc29/conda_envs/gubbins/bin/run_gubbins.py", line 33, in <module>
    sys.exit(load_entry_point('gubbins==3.2.1', 'console_scripts', 'run_gubbins.py')())
  File "/home/pgc29/project/conda_envs/gubbins/lib/python3.10/site-packages/gubbins/run_gubbins.py", line 155, in main
    gubbins.common.parse_and_run(parser.parse_args(), parser.description)
  File "/home/pgc29/project/conda_envs/gubbins/lib/python3.10/site-packages/gubbins/common.py", line 251, in parse_and_run
    reroot_tree(str(current_tree_name), input_args.outgroup)
  File "/home/pgc29/project/conda_envs/gubbins/lib/python3.10/site-packages/gubbins/common.py", line 752, in reroot_tree
    reroot_tree_with_outgroup(tree_name, outgroups.split(','))
  File "/home/pgc29/project/conda_envs/gubbins/lib/python3.10/site-packages/gubbins/common.py", line 760, in reroot_tree_with_outgroup
    outgroup_mrca = tree.mrca(taxon_labels=clade_outgroups)
  File "/home/pgc29/project/conda_envs/gubbins/lib/python3.10/site-packages/dendropy/datamodel/treemodel.py", line 4013, in mrca
    raise KeyError("Not all labels matched to taxa")
KeyError: 'Not all labels matched to taxa'

Any ideas what could be wrong?

nickjcroucher commented 2 years ago

Sorry, I can't reproduce this error with an analogous alignment locally - my guess would be the divergent outgroup isn't mapping well to the reference, so the sequence is excluded from the analysed alignment due to the high number of gaps?

pgcudahy commented 2 years ago

Ah, you're correct. I now see that earlier there was output of

Filtering input alignment...
Excluded sequence M_abscessus_ATCC_19977 because it had 99.7203945041546 percentage missing data while a maximum of 25.0 is allowed
...done. Run time: 46.08 s

Sorry for the bother, I'm running this on a cluster so harder to see the output logs.