immcantation / presto

pRESTO is part of the Immcantation analysis framework for Adaptive Immune Receptor Repertoire sequencing (AIRR-seq). pRESTO is a bioinformatics toolkit for processing high-throughput lymphocyte receptor sequencing data.
https://presto.readthedocs.io
GNU Affero General Public License v3.0
0 stars 0 forks source link

AssemblePairs get stuck #71

Closed ssnn-airr closed 4 years ago

ssnn-airr commented 4 years ago

Original report by Julian Zhou (Bitbucket: jqz, GitHub: julianqz).


No way to reproduce this (it doesn’t happen every time, but does happen ~9 out of 10 times in my experience [on Farnam]), but AssemblePairs.py sequential --aligner blastn tends to get stuck before finishing, anywhere between 5% to 95%. @{5a0336c6c24b5074212438b7} suggested that it might be a file system issue with blastn, with a potential alternative being trying to replace blastn by something like a Smith-Waterman algorithm implemented in native Python.

ssnn-airr commented 4 years ago

Original comment by Julian Zhou (Bitbucket: jqz, GitHub: julianqz).


Duplicate of #65.

ssnn-airr commented 4 years ago

Original comment by Hailong Meng (Bitbucket: hmeng, GitHub: hlmeng).


I have the problem with the older version image of presto. Now with the new version, it seems fine with me.

ssnn-airr commented 4 years ago

Original comment by Jason Vander Heiden (Bitbucket: javh, GitHub: javh).


No, MaskPrimers doesn’t call blastn. I do recall having what appears to be disk related issues on farnam though. So, it could be something outside our control.

ssnn-airr commented 4 years ago

Original comment by Julian Zhou (Bitbucket: jqz, GitHub: julianqz).


Oh, there’s actually already an issue (#65) for this! I created this after bringing it up at subgroup meeting and Steve said that we should make a note of it even if it’s hard to reproduce.

I haven’t tried using it with usearch.

As get-around, I’ve been breaking my input files into chunks and passing them individually to AssemblePairs. It still hangs sometimes; or there could be a core dump and when I compare the input # reads vs. output # reads (passed + failed), a discrepancy appears. But this way at least I only have to re-run for one small chunk, as opposed to having to wait for the entire lot to run through again.

ssnn-airr commented 4 years ago

Original comment by Julian Zhou (Bitbucket: jqz, GitHub: julianqz).


@{557058:3ff8019e-a4ea-4ddd-bc5f-9c57815dc74b} mentioned that he experienced a similar issue with MaskPrimers (which also calls blastn?)

ssnn-airr commented 4 years ago

Original comment by Jason Vander Heiden (Bitbucket: javh, GitHub: javh).


Does it get stuck if you use usearch?

Also, the native SW in python is really slow. There are some C coded Striped Smith-Waterman libraries for Python out there, so that would be a better approach. Last I checked, I couldn’t get them to install, but that was years ago.

The blastn/usearch wrapper setup is pretty inefficient right now, having bigger chunks of sequences (maybe even just 1 chunk) passed into blastn/usearch for reference alignment might help.