KeyError line 532 in mob_cluster.py

awh082834 commented 1 year ago

Hi, I am trying to test mob_cluster to build a large database of plasmids. While testing I have run into a KeyError on line 532 in mob_cluster.py.

mob_cluster --mode build -f plasmid_multifasta.fasta -p test.txt -t acc_species.txt --outdir test_build

2023-05-24 17:17:28,388 root INFO: Running Mob-Suite Clustering toolkit v. 3.1.0 [in /home/.../miniconda3/lib/python3.7/site-packages/mob_suite/mob_cluster.py:452]
2023-05-24 17:17:28,389 root INFO: Processing fasta file plasmid_multifasta.fasta [in /home/.../miniconda3/lib/python3.7/site-packages/mob_suite/mob_cluster.py:453]
2023-05-24 17:17:28,389 root INFO: Analysis directory test_build [in /home/.../miniconda3/lib/python3.7/site-packages/mob_suite/mob_cluster.py:454]
2023-05-24 17:17:28,389 root INFO: SUCCESS: Found program blastn at /home/.../miniconda3/bin/blastn [in /home/.../miniconda3/lib/python3.7/site-packages/mob_suite/utils.py:617]
2023-05-24 17:17:28,389 root INFO: SUCCESS: Found program makeblastdb at /home/.../miniconda3/bin/makeblastdb [in /home/.../miniconda3/lib/python3.7/site-packages/mob_suite/utils.py:617]
2023-05-24 17:17:28,389 root INFO: SUCCESS: Found program tblastn at /home/.../miniconda3/bin/tblastn [in /home/.../miniconda3/lib/python3.7/site-packages/mob_suite/utils.py:617]
Traceback (most recent call last):
  File "/home/.../miniconda3/bin/mob_cluster", line 10, in <module>
    sys.exit(main())
  File "/home/.../miniconda3/lib/python3.7/site-packages/mob_suite/mob_cluster.py", line 532, in main
    organism = new_seq_info[seq_id]['organism']
KeyError: 'organism'

Not sure where this error is coming from or if it is something with the files that I used for the inputs. I followed the scheme for the -t as well as generating the -p from the multifasta of plasmids as stated in a previous issue. Any help is appreciated!

Thanks!

jrober84 commented 1 year ago

can you give first few lines of your test.txt and acc_species files?

awh082834 commented 1 year ago

can you give first few lines of your test.txt and acc_species files?

Sure!

test.txt :

sample_id       num_contigs     size    gc      md5     rep_type(s)     rep_type_accession(s)   relaxase_type(s)        relaxase_type_accession(s)      mpf_type        mpf_type_accession($
plasmid_multifasta      6       -       45.80016295798813       d3b67463a20e833d37a15d555d7e0de0        rep_cluster_1522        000876__NC_009926_00214 MOBF,MOBF,MOBF,MOBF,MOBF        NC_$

acc_species.txt

id  organism
NZ_AP026076     Acaryochloris_marina_MBIC10699
NZ_AP026077     Acaryochloris_marina_MBIC10699
NZ_AP026078     Acaryochloris_marina_MBIC10699
NZ_AP026079     Acaryochloris_marina_MBIC10699
NC_009926       Acaryochloris_marina_MBIC11017
NC_009927       Acaryochloris_marina_MBIC11017

Thank you for the help!

jrober84 commented 1 year ago

Ah ok, so you ran MOB-typer without specifying -x or --multi. By default MOB-typer treats the entire fasta as one plasmid without the multi flag. So the output you have for mob-typer is the set of sequences merged into one entity. The sample_id's need to match between the mobtyper and species files.

Run MOB-typer on your sequences again but specify -x , and then use that file with your species identifications.

MOB-suite looks up the organism name in NCBI taxonomy db, if your name doesn't match then it will fail. I believe you have replaced all of the spaces in your organism name with "_" since the name in NCBI is "Acaryochloris marina MBIC10699".

Hope that helps!

awh082834 commented 1 year ago

It seems that I am getting the same error. Here are examples of each of the input files.

test.txt

sample_id       num_contigs     size    gc      md5     rep_type(s)     rep_type_accession(s)   relaxase_type(s)        relaxase_type_accession(s)      mpf_type        mpf_type_accession(s)   ori$
NZ_AP026076.1 1 393608  45.12408284384464       861d889a756bd5ac51968e6882f2f9ee        -       -       MOBF    NC_009927_00233 MPF_T   NC_009927_00237 -       -       conjugative     CP000839   $
NZ_AP026077.1 1 329949  46.69085222261622       840c6b922f7e8b9b9b028f79162fd924        -       -       MOBF    NC_009931_00136 MPF_T   NC_009931_00124 -       -       conjugative     CP000843   $
NZ_AP026078.1 1 303490  46.09575274308873       48588e50149bb0f56c9520d34418ff94        -       -       MOBF    NC_009929_00047 MPF_T   NC_009930_00158 -       -       conjugative     CP000838   $
NZ_AP026079.1 1 205174  43.208691159698596      3facb92109e65f2e622448046f573ee4        -       -       MOBF    NC_009932_00077 MPF_T   NC_009930_00058,NC_009932_00071 -       -       conjugative$
NC_009926.1 1   374161  47.34833400594931       f2978c55c5f74900466debf49a768e3a        rep_cluster_1522        000876__NC_009926_00214 MOBF    NC_009926_00327 MPF_T   NC_009926_00331 -       -  $
NC_009927.1 1   356087  45.33667334106553       4148acb719c6a0b45229eb58a301259a        -       -       MOBF    NC_009927_00233 MPF_T   NC_009927_00237 -       -       conjugative     CP000839   $

acc_species.txt

id      organism
NZ_AP026076.1   Acaryochloris marina MBIC10699
NZ_AP026077.1   Acaryochloris marina MBIC10699
NZ_AP026078.1   Acaryochloris marina MBIC10699
NZ_AP026079.1   Acaryochloris marina MBIC10699
NC_009926.1     Acaryochloris marina MBIC11017
NC_009927.1     Acaryochloris marina MBIC11017

Header example of plasmid_multifasta.fasta

>NZ_AP026076.1
ACCTTGTTCTTAAGCGTTTGATTAAAAACTGTAGGCCACCAAAAAATAAGACTTCAAATTCTCGCGAGAA
TCCAACACCATTAACATCTGGCTACCCCACATCTTGAAACAGGATTGATAGCCGAGTGATTAATGCTCCC

I made sure to double check that everything matches to one another however I still get a KeyError on 'organism' as in the original comment.

Thank you for all your help so far!

jrober84 commented 1 year ago

Thanks for the info, I am unable to replicate your error in the latest code pull. I recommend installing from github via pip pip install git+https://github.com/phac-nml/mob-suite. Could you try that out and see if it resolves your issue?

phac-nml / mob-suite

KeyError line 532 in mob_cluster.py #137