phac-nml / sistr_cmd

SISTR (Salmonella In Silico Typing Resource) command-line tool
Apache License 2.0
25 stars 9 forks source link

OSerror when running sistr #16

Closed tom-van-wijk closed 7 years ago

tom-van-wijk commented 7 years ago

Dear Peter,

First of all, thank you very much for sharing this tool with the community! It seems very usefull and I would be glad to use it.

I have all the required python libraries and the newest version of ncbi blast installed. I installed sistr-cmd using pip install. However when I use the following command:

sistr -f csv -o test.csv 1091502269_S25_scaffolds_de-novo.fasta

I get the following error:

Traceback (most recent call last): File "/usr/local/bin/sistr", line 9, in load_entry_point('sistr-cmd==0.3.4', 'console_scripts', 'sistr')() File "/usr/local/lib/python2.7/dist-packages/sistr/sistr_cmd.py", line 320, in main outputs = [sistr_predict(input_fasta, genome_name, tmp_dir, keep_tmp, args) for input_fasta, genome_name in zip(input_fastas, genome_names)] File "/usr/local/lib/python2.7/dist-packages/sistr/sistr_cmd.py", line 187, in sistr_predict cgmlst_prediction, cgmlst_results = run_cgmlst(blast_runner, full=args.use_full_cgmlst_db) File "/usr/local/lib/python2.7/dist-packages/sistr/src/cgmlst/init.py", line 342, in run_cgmlst full=full) File "/usr/local/lib/python2.7/dist-packages/sistr/src/cgmlst/init.py", line 134, in get_allele_sequences msa_ref, msa_novel = msa_ref_vs_novel(ref_seq, allele_seq) File "/usr/local/lib/python2.7/dist-packages/sistr/src/cgmlst/msa.py", line 58, in msa_ref_vs_novel msa_out_dict = msa_mafft(input_fasta) File "/usr/local/lib/python2.7/dist-packages/sistr/src/cgmlst/msa.py", line 37, in msa_mafft p = Popen(['mafft', '-'], stdin=PIPE, stdout=PIPE, stderr=PIPE) File "/usr/lib/python2.7/subprocess.py", line 710, in init errread, errwrite) File "/usr/lib/python2.7/subprocess.py", line 1327, in _execute_child raise child_exception OSError: [Errno 2] No such file or directory

Hopefully you can find some time to help me out. Thank you.

Kind regards, Tom van Wijk

peterk87 commented 7 years ago

Hi Tom,

Great to hear that this tool would be useful for you!

Looking through the error report you posted, it looks like you don't have the MAFFT multiple sequence alignment program (http://mafft.cbrc.jp/alignment/software/) installed in your $PATH. It also looks like the README is out-of-date so I'll have to update it to show that MAFFT is now required for sistr_cmd to run properly.

If you have Anaconda or Miniconda installed with the BioConda channel (conda config --add channels bioconda), you might have an easier time installing sistr_cmd (or other bioinformatics software) along with all its dependencies (conda install sistr_cmd). See here for instructions on how to get Conda and Bioconda working:

https://bioconda.github.io/

Hope that solves your issue and gets you up and running!

BTW, there's also a web version of SISTR available at https://lfz.corefacility.ca/sistr-app/

tom-van-wijk commented 7 years ago

Hi Peter,

Thanks for your help! Indeed I don have MAFFT aligner installed because as you mentioned, it is not listed as a requirement in the README. I wonder, since sistr-cmd is now using MAFFT, does it now no longer require BLAST?

I will install MAFFT, test and post my finding here today. Thanks again!

Kind regards, Tom

tom-van-wijk commented 7 years ago

Hi Peter,

sistr-cmd seems to be working fine now. It did a 100% accuracy on my test set consistent of 7 serovars. I am also positively suprised by the short runtime and low system requirements! Great!

Kind regards, Tom

peterk87 commented 7 years ago

That's great to hear that you got it working!

BLAST is used for sequence searching (antigen genes, cgMLST genes) and MAFFT multiple sequence alignment is used to validate cgMLST alleles extracted from input genomes.

I've updated the README to reflect the changes in dependencies, output options, etc.

I am also positively suprised by the short runtime and low system requirements! Great!

You can run sistr_cmd with multiple threads (e.g. --threads <ncpu>) if you have multiple input genomes. This could significantly speedup runtime on large datasets. :smiley:

Let me know if you have any other issues!