qunfengdong / BLCA

34 stars 12 forks source link

I am interested in parallel BLCA. I tried splitting the FASTA input.. #14

Closed wolfgangrumpf closed 5 years ago

wolfgangrumpf commented 5 years ago

I also am interested in parallel BLCA. I tried splitting the FASTA input into 10 files and then running BLCA separately on each one, but I saw these errors in the output log:

ERROR No sequences in input file blastdbcmd is located in your PATH! muscle is located in your PATH!

Fasta file read in!! Reading in taxonomy information! .... blastn is located in your PATH! Running blast!! Blastn Finished!! Read in blast output! Traceback (most recent call last): File "/opt/blca/2.1/2.blca_main.py", line 295, in alndic=get_dic_from_aln(k1+".muscle") File "/opt/blca/2.1/2.blca_main.py", line 70, in get_dic_from_aln alignment=AlignIO.read(aln,"clustal") File "/opt/blca/2.1/lib/python2.7/site-packages/Bio/AlignIO/init.py", line 435, in read first = next(iterator) File "/opt/blca/2.1/lib/python2.7/site-packages/Bio/AlignIO/init.py", line 357, in parse with as_handle(handle, 'rU') as fp: File "/gpfs0/export/opt/anaconda-2.3.0/lib/python2.7/contextlib.py", line 17, in enter return self.gen.next() File "/opt/blca/2.1/lib/python2.7/site-packages/Bio/File.py", line 113, in as_handle with open(handleish, mode, **kwargs) as fp: IOError: [Errno 2] No such file or directory: '21758886.muscle' Command line argument error: Argument "entry_batch". File is not accessible: 21758886.dblist' rm: cannot remove21758886.dblist': No such file or directory blastdbcmd is located in your PATH! muscle is located in your PATH! Fasta file read in!! Reading in taxonomy information! .... blastn is located in your PATH! Running blast!! Blastn Finished!! Read in blast output! Traceback (most recent call last): File "/opt/blca/2.1/2.blca_main.py", line 295, in alndic=get_dic_from_aln(k1+".muscle") File "/opt/blca/2.1/2.blca_main.py", line 70, in get_dic_from_aln alignment=AlignIO.read(aln,"clustal") File "/opt/blca/2.1/lib/python2.7/site-packages/Bio/AlignIO/init.py", line 435, in read first = next(iterator) File "/opt/blca/2.1/lib/python2.7/site-packages/Bio/AlignIO/init.py", line 382, in parse for a in i: File "/opt/blca/2.1/lib/python2.7/site-packages/Bio/AlignIO/ClustalIO.py", line 115, in next ", ".join(known_headers))) ValueError: >21758886 is not a known CLUSTAL header: CLUSTAL, PROBCONS, MUSCLE, MSAPROBS, Kalign srun: error: node03: tasks 3,9: Exited with exit code 1 rm: cannot remove 21758886.hitdb.fsa': No such file or directory rm: cannot remove21758886.hitdb.fsa': No such file or directory

Originally posted by @wolfgangrumpf in https://github.com/qunfengdong/BLCA/pull/12#issuecomment-465307469

yingeddi2008 commented 5 years ago

Hi Wolfgangrumpf,

Thanks for taking an interest in our software. I'd happy to assist you with any issue regarding BLCA.

First, could you please check your blastn version? It should be above 2.5.0. Also, please make sure you did NOT clone the github repo, but downloaded the package from the release tab, the python2.7 version. The current github repo is a mixed python version of 2.7 and 3, so it won't work properly yet.

Best,

Eddi

On Tue, Feb 19, 2019 at 3:05 PM Wolfgang Rumpf notifications@github.com wrote:

I also am interested in parallel BLCA. I tried splitting the FASTA input into 10 files and then running BLCA separately on each one, but I saw these errors in the output log:

ERROR No sequences in input file blastdbcmd is located in your PATH! muscle is located in your PATH!

Fasta file read in!! Reading in taxonomy information! .... blastn is located in your PATH! Running blast!! Blastn Finished!! Read in blast output! Traceback (most recent call last): File "/opt/blca/2.1/2.blca_main.py", line 295, in alndic=get_dic_from_aln(k1+".muscle") File "/opt/blca/2.1/2.blca_main.py", line 70, in get_dic_from_aln alignment=AlignIO.read(aln,"clustal") File "/opt/blca/2.1/lib/python2.7/site-packages/Bio/AlignIO/init.py", line 435, in read first = next(iterator) File "/opt/blca/2.1/lib/python2.7/site-packages/Bio/AlignIO/init.py", line 357, in parse with as_handle(handle, 'rU') as fp: File "/gpfs0/export/opt/anaconda-2.3.0/lib/python2.7/contextlib.py", line 17, in enter return self.gen.next() File "/opt/blca/2.1/lib/python2.7/site-packages/Bio/File.py", line 113, in as_handle with open(handleish, mode, *kwargs) as fp: IOError: [Errno 2] No such file or directory: '21758886.muscle' Command line argument error: Argument "entry_batch". File is not accessible: 21758886.dblist' rm: cannot remove 21758886.dblist': No such file or directory blastdbcmd is located in your PATH! muscle is located in your PATH! Fasta file read in!! Reading in taxonomy information! .... blastn is located in your PATH! Running blast!! Blastn Finished!! Read in blast output! Traceback (most recent call last): File "/opt/blca/2.1/2.blca_main.py", line 295, in alndic=get_dic_from_aln(k1+".muscle") File "/opt/blca/2.1/2.blca_main.py", line 70, in get_dic_from_aln alignment=AlignIO.read(aln,"clustal") File "/opt/blca/2.1/lib/python2.7/site-packages/Bio/AlignIO/init.py", line 435, in read first = next(iterator) File "/opt/blca/2.1/lib/python2.7/site-packages/Bio/AlignIO/init.py", line 382, in parse for a in i: File "/opt/blca/2.1/lib/python2.7/site-packages/Bio/AlignIO/ClustalIO.py", line 115, in next* ", ".join(known_headers))) ValueError: >21758886 is not a known CLUSTAL header: CLUSTAL, PROBCONS, MUSCLE, MSAPROBS, Kalign srun: error: node03: tasks 3,9: Exited with exit code 1 rm: cannot remove 21758886.hitdb.fsa': No such file or directory rm: cannot remove 21758886.hitdb.fsa': No such file or directory

*Originally posted by @wolfgangrumpf https://github.com/wolfgangrumpf in

12 (comment)

https://github.com/qunfengdong/BLCA/pull/12#issuecomment-465307469*

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/qunfengdong/BLCA/issues/14, or mute the thread https://github.com/notifications/unsubscribe-auth/AHCP060PkT1pEkHYXr0KnHYrZ3FsfKFhks5vPGb8gaJpZM4bD3NC .

wolfgangrumpf commented 5 years ago

I am using BLAST 2.5.0. I used pyfasta to split the input file into 9 separate files. They are all in the same directory. I am BLASTing in series, so the first one finishes BLASTing and then starts the next step in the workflow while the second one is BLASTing. Should I instead create new directories for each file and segregate the jobs in those compartments?

I'm asking our HPC admin how BLCA was installed....

wolfgangrumpf commented 5 years ago

And they say that we installed it from the release tab, not by cloning the distro.

yingeddi2008 commented 5 years ago

Hi Wolfgangrumpf,

Judging from the error message, it seems like an input/output issue. Please try separating each batch in a different folder, and see how it goes. Did you first try running the test file without parallel? Did it work? If it worked, you should definitely separate the input files. It seems that you have sequences in different files that have the same IDs.

Let me know how it goes,

Eddi

On Wed, Feb 20, 2019 at 9:29 AM Wolfgang Rumpf notifications@github.com wrote:

And they say that we installed it from the release tab, not by cloning the distro.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/qunfengdong/BLCA/issues/14#issuecomment-465625821, or mute the thread https://github.com/notifications/unsubscribe-auth/AHCP0_v882rS3yBbXISKTHqErJ5l4mKoks5vPWnTgaJpZM4bD3NC .

wolfgangrumpf commented 5 years ago

Yes, it ran albeit very slowly after the initial BLAST job. I split things into 9 files in their own directories and executed 9 jobs, each with 2 cpus, on a 20 cpu node - it appears to be working. The hardest part was figuring out the correct SLURM commands to make the jobs run simultaneously, but I finally got it. Thanks for your help!

yingeddi2008 commented 5 years ago

I am closing this issue.

FYI, I just uploaded a utility script for merging multiple BLCA outputs. It could be useful if you want to generate count tables from BLCA taxonomy assignment.