immcantation / presto

pRESTO is part of the Immcantation analysis framework for Adaptive Immune Receptor Repertoire sequencing (AIRR-seq). pRESTO is a bioinformatics toolkit for processing high-throughput lymphocyte receptor sequencing data.
https://presto.readthedocs.io
GNU Affero General Public License v3.0
0 stars 0 forks source link

AssemblePairs hanging (most likely due to blastn?) #65

Open ssnn-airr opened 6 years ago

ssnn-airr commented 6 years ago

Original report by Julian Zhou (Bitbucket: jqz, GitHub: julianqz).


Is there a way to catch such behaviour? There's hardly anything worse than waking up in the morning, fully expecting a job to have finished running, only to realize that a core dump had occurred and that the process was stuck on the AP step forever. Somehow this never triggers a job failure and never gets caught by slurm (happened multiple times to me so far).

SCAN_REVERSE> True
MIN_IDENT> 0.5
EVALUE> 1e-05
MAX_HITS> 100
FILL> False
ALIGNER> blastn
NPROC> 20

PROGRESS> 04:12:26 |                    |   0% (      0) 0.0 min
PROGRESS> 04:14:05 |#                   |   5% ( 37,145) 1.7 min
PROGRESS> 04:15:44 |##                  |  10% ( 74,290) 3.3 min
PROGRESS> 04:17:24 |###                 |  15% (111,435) 5.0 min
PROGRESS> 04:19:04 |####                |  20% (148,580) 6.6 mi
ssnn-airr commented 4 years ago

Original comment by Julian Zhou (Bitbucket: jqz, GitHub: julianqz).


Issue #71 was marked as a duplicate of this issue.

ssnn-airr commented 6 years ago

Original comment by Jason Vander Heiden (Bitbucket: javh, GitHub: javh).


The pipeline script should dump errors to a separate file (logs/pipeline.err).

This looks like a compilation issue with the scientific python libraries. Try using the singularity image? Make sure to specify --cleanenv to singularity exec. And maybe --containall if you still have issues with the image accessing the host environment.

ssnn-airr commented 6 years ago

Original comment by Julian Zhou (Bitbucket: jqz, GitHub: julianqz).


So not sure how much there is to do about this after all, unless you want to add a warning or something that gets issued if the step goes on for more than a certain amount of time. Otherwise, feel free to close the issue!

ssnn-airr commented 6 years ago

Original comment by Julian Zhou (Bitbucket: jqz, GitHub: julianqz).


Yep, not on m915. Everything was run on nx360, on both Farnam and Ruddle.

It seems like if this thing happens, you just have to keep re-running it until it works through the step.. sometimes re-running multiple times.

Just now, a sample finished re-running the AP step, but then this happened:

25/10/2018 10:23:26
IDENTIFIER: 9
DIRECTORY: /home/qz93/project/ellebedy_bulk/presto/sample_9/
PRESTO VERSION: 0.5.10-2018.10.19

START
   1: AssemblePairs sequential 10:23 10/25/18
ERROR:
    *** Error in `/ycga-gpfs/apps/hpc/software/Python/3.5.1-foss-2016b/bin/python': free(): invalid pointer:
 0x00002b2fb75c8120 ***
    ======= Backtrace: =========
    /lib64/libc.so.6(+0x7c619)[0x2b2f8130c619]
    /ycga-gpfs/apps/hpc/software/Python/3.5.1-foss-2016b/lib/python3.5/site-packages/numpy/core/multiarray.c
python-35m-x86_64-linux-gnu.so(+0x7c236)[0x2b2f8a507236]
    /ycga-gpfs/apps/hpc/software/Python/3.5.1-foss-2016b/lib/python3.5/site-packages/numpy/core/multiarray.c
python-35m-x86_64-linux-gnu.so(+0x21dfe)[0x2b2f8a4acdfe]
    /ycga-gpfs/apps/hpc/software/Python/3.5.1-foss-2016b/lib/python3.5/site-packages/pandas-0.17.1-py3.5-lin
ux-x86_64.egg/pandas/lib.cpython-35m-x86_64-linux-gnu.so(+0x75f4a)[0x2b2fb444df4a]
    /ycga-gpfs/apps/hpc/software/Python/3.5.1-foss-2016b/lib/python3.5/site-packages/pandas-0.17.1-py3.5-lin
ux-x86_64.egg/pandas/lib.cpython-35m-x86_64-linux-gnu.so(+0x7a2cd)[0x2b2fb44522cd]
    /ycga-gpfs/apps/hpc/software/Python/3.5.1-foss-2016b/lib/libpython3.5m.so.1.0(PyCFunction_Call+0xe9)[0x2
b2f806cd5c9]
    /ycga-gpfs/apps/hpc/software/Python/3.5.1-foss-2016b/lib/python3.5/site-packages/pandas-0.17.1-py3.5-lin
ux-x86_64.egg/pandas/lib.cpython-35m-x86_64-linux-gnu.so(+0x218fd)[0x2b2fb43f98fd]
    /ycga-gpfs/apps/hpc/software/Python/3.5.1-foss-2016b/lib/python3.5/site-packages/pandas-0.17.1-py3.5-lin
ux-x86_64.egg/pandas/lib.cpython-35m-x86_64-linux-gnu.so(+0x4b1dd)[0x2b2fb44231dd]
    /ycga-gpfs/apps/hpc/software/Python/3.5.1-foss-2016b/lib/libpython3.5m.so.1.0(PyEval_EvalFrameEx+0x8696)
[0x2b2f807620c6]
    /ycga-gpfs/apps/hpc/software/Python/3.5.1-foss-2016b/lib/libpython3.5m.so.1.0(+0x1727a1)[0x2b2f807637a1]
    /ycga-gpfs/apps/hpc/software/Python/3.5.1-foss-2016b/lib/libpython3.5m.so.1.0(PyEval_EvalFrameEx+0x6661)
[0x2b2f80760091]
    /ycga-gpfs/apps/hpc/software/Python/3.5.1-foss-2016b/lib/libpython3.5m.so.1.0(+0x1727a1)[0x2b2f807637a1]
    /ycga-gpfs/apps/hpc/software/Python/3.5.1-foss-2016b/lib/libpython3.5m.so.1.0(PyEval_EvalFrameEx+0x6661)
[0x2b2f80760091]
    /ycga-gpfs/apps/hpc/software/Python/3.5.1-foss-2016b/lib/libpython3.5m.so.1.0(+0x1727a1)[0x2b2f807637a1]
    /ycga-gpfs/apps/hpc/software/Python/3.5.1-foss-2016b/lib/libpython3.5m.so.1.0(PyEval_EvalCodeEx+0x23)[0x
2b2f80763893]
    /ycga-gpfs/apps/hpc/software/Python/3.5.1-foss-2016b/lib/libpython3.5m.so.1.0(+0xb8cb5)[0x2b2f806a9cb5]
    /ycga-gpfs/apps/hpc/software/Python/3.5.1-foss-2016b/lib/libpython3.5m.so.1.0(PyObject_Call+0x6a)[0x2b2f
80678e8a]
    /ycga-gpfs/apps/hpc/software/Python/3.5.1-foss-2016b/lib/libpython3.5m.so.1.0(+0xa0954)[0x2b2f80691954]
    /ycga-gpfs/apps/hpc/software/P

(and then it goes on and on like this for pages)

Like o_O. But then I ran it again and it pushed through to the next step (MaskPrimers with Internal C).

ssnn-airr commented 6 years ago

Original comment by Jason Vander Heiden (Bitbucket: javh, GitHub: javh).


If it happens on both farnam and ruddle then it's probably not an issue with those older CPUs systems we have in the kleinstein queue. The m915s, I think? Whichever systems it was that we had to compile R packages for separately.

You could try restricting the types of nodes used? In case it is a CPU architecture compatibility issue. There's some instructions regarding that in the "Software" and "Compute Hardware" sections here: https://research.computing.yale.edu/support/hpc/clusters/farnam

ssnn-airr commented 6 years ago

Original comment by Julian Zhou (Bitbucket: jqz, GitHub: julianqz).


(Thank goodness. That would be a lifesaver :D)

ssnn-airr commented 6 years ago

Original comment by Jason Vander Heiden (Bitbucket: javh, GitHub: javh).


Probably? I've never actually seen AssemblePairs hang, but that looks like something that's stuck.

(BTW - You can use the triple-backtick code fence syntax for large blocks. You don't have to backtick every line.)

ssnn-airr commented 6 years ago

Original comment by Julian Zhou (Bitbucket: jqz, GitHub: julianqz).


Also, if I'm seeing this, chances are it probably hanged too, right? (not 100% sure coz there's no core dump file yet; but it's been like this for +1 hr)

PROGRESS> 08:48:43 | | 0% ( 0) 0.0 min

PROGRESS> 08:48:58 |# | 5% ( 5,602) 0.3 min

PROGRESS> 08:49:15 |## | 10% ( 11,204) 0.5 min

PROGRESS> 08:49:30 |### | 15% ( 16,806) 0.8 min

PROGRESS> 08:49:45 |#### | 20% ( 22,408) 1.0 min

PROGRESS> 08:50:01 |##### | 25% ( 28,010) 1.3 min

PROGRESS> 08:50:16 |###### | 30% ( 33,612) 1.6 min

PROGRESS> 08:50:31 |####### | 35% ( 39,214) 1.8 min

PROGRESS> 08:50:46 |######## | 40% ( 44,816) 2.1 min

PROGRESS> 08:51:03 |######### | 45% ( 50,418) 2.3 min

PROGRESS> 08:51:19 |########## | 50% ( 56,020) 2.6 min

PROGRESS> 08:51:34 |########### | 55% ( 61,622) 2.9 min

PROGRESS> 08:51:50 |############ | 60% ( 67,224) 3.1 min

PROGRESS> 08:52:05 |############# | 65% ( 72,826) 3.4 min

PROGRESS> 08:52:21 |############## | 70% ( 78,428) 3.6 min

PROGRESS> 08:52:37 |############### | 75% ( 84,030) 3.9 min

PROGRESS> 08:52:53 |################ | 80% ( 89,632) 4.2 min

PROGRESS> 08:53:08 |################# | 85% ( 95,234) 4.4 min

PROGRESS> 08:53:24 |################## | 90% (100,836) 4.7 min

PROGRESS> 08:53:39 |################### | 95% (106,438) 4.9 min

ssnn-airr commented 6 years ago

Original comment by Julian Zhou (Bitbucket: jqz, GitHub: julianqz).


The last time this happened, I was using usearch on Farnam. This time, this was with blastn on Ruddle.

ssnn-airr commented 6 years ago

Original comment by Jason Vander Heiden (Bitbucket: javh, GitHub: javh).


Does it happen on not-farnam?

You could try using usearch for the aligner instead. It isn't in the image, but you should be able to just put the binary in one of the mount point folders and run it from there.

Not sure that'll help though, because this has a farnam weirdness smell to it...