qunfengdong / BLCA

34 stars 12 forks source link

Update 2.blca_main.py #12

Closed koopkaup closed 5 years ago

koopkaup commented 5 years ago

Python 3 compatible code with some speed boost and code arrangement.

yingeddi2008 commented 5 years ago

Thanks for contributing to our program!! I appreciated your effort a lot. And we welcome any future collaboration. However, we do want to make sure the most recent code on github is functional. So could you please double-check if the code runs perfectly? I see several possible indentation inconsistencies.

koopkaup commented 5 years ago

I have used it on a HPC cluster and it does work. However, now is the main bottleneck read alignment step. I have 1.3M hits in my fasta file and after two days only 250000 were classified. I recommend to run this part in parallel. Maybe split the fsadic and use it in chunks?

qunfengdong commented 5 years ago

Thanks so much for your very helpful contribution. Just one comment regarding the speed of BLCA: certainly it would be great if a parallel way can be implemented. Our original design for BLCA is for it to be applied to a relatively small number of query sequences instead of a large number of inputs (e.g., all the raw 16S sequences). The relatively small number of query sequences typically correspond to, e.g., some OTUs of interest (for example, OTUs with statistically significantly different abundance or prevalence between different niches). Those OTUs of interest typically require in-depth taxonomic classification. So, for a input file of thousands or tens of thousands of query sequences, BLCA is probably OK; but for a input file with millions of query sequences, BLCA is probably going to take days.

On Wed, Nov 7, 2018 at 1:30 AM Kristjan notifications@github.com wrote:

I have used it on a HPC cluster and it does work. However, now is the main bottleneck read alignment step. I have 1.3M hits in my fasta file and after two days only 250000 were classified. I recommend to run this part in parallel. Maybe split the fsadic and use it in chunks?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/qunfengdong/BLCA/pull/12#issuecomment-436530282, or mute the thread https://github.com/notifications/unsubscribe-auth/ARwbk-gh1q2d5co9XSaHap2mMDIJ8cygks5usowogaJpZM4YNR9u .

wolfgangrumpf commented 5 years ago

I also am interested in parallel BLCA. I tried splitting the FASTA input into 10 files and then running BLCA separately on each one, but I saw these errors in the output log:

ERROR No sequences in input file blastdbcmd is located in your PATH! muscle is located in your PATH!

Fasta file read in!! Reading in taxonomy information! .... blastn is located in your PATH! Running blast!! Blastn Finished!! Read in blast output! Traceback (most recent call last): File "/opt/blca/2.1/2.blca_main.py", line 295, in alndic=get_dic_from_aln(k1+".muscle") File "/opt/blca/2.1/2.blca_main.py", line 70, in get_dic_from_aln alignment=AlignIO.read(aln,"clustal") File "/opt/blca/2.1/lib/python2.7/site-packages/Bio/AlignIO/init.py", line 435, in read first = next(iterator) File "/opt/blca/2.1/lib/python2.7/site-packages/Bio/AlignIO/init.py", line 357, in parse with as_handle(handle, 'rU') as fp: File "/gpfs0/export/opt/anaconda-2.3.0/lib/python2.7/contextlib.py", line 17, in enter return self.gen.next() File "/opt/blca/2.1/lib/python2.7/site-packages/Bio/File.py", line 113, in as_handle with open(handleish, mode, **kwargs) as fp: IOError: [Errno 2] No such file or directory: '21758886.muscle' Command line argument error: Argument "entry_batch". File is not accessible: 21758886.dblist' rm: cannot remove21758886.dblist': No such file or directory blastdbcmd is located in your PATH! muscle is located in your PATH! Fasta file read in!! Reading in taxonomy information! .... blastn is located in your PATH! Running blast!! Blastn Finished!! Read in blast output! Traceback (most recent call last): File "/opt/blca/2.1/2.blca_main.py", line 295, in alndic=get_dic_from_aln(k1+".muscle") File "/opt/blca/2.1/2.blca_main.py", line 70, in get_dic_from_aln alignment=AlignIO.read(aln,"clustal") File "/opt/blca/2.1/lib/python2.7/site-packages/Bio/AlignIO/init.py", line 435, in read first = next(iterator) File "/opt/blca/2.1/lib/python2.7/site-packages/Bio/AlignIO/init.py", line 382, in parse for a in i: File "/opt/blca/2.1/lib/python2.7/site-packages/Bio/AlignIO/ClustalIO.py", line 115, in next ", ".join(known_headers))) ValueError: >21758886 is not a known CLUSTAL header: CLUSTAL, PROBCONS, MUSCLE, MSAPROBS, Kalign srun: error: node03: tasks 3,9: Exited with exit code 1 rm: cannot remove 21758886.hitdb.fsa': No such file or directory rm: cannot remove21758886.hitdb.fsa': No such file or directory