Closed tavinathanson closed 7 years ago
So I don't think it's related to getting BAMs as input, since it's not following that code path.
Rather, it appears to do this:
https://github.com/FRED-2/OptiType/blob/master/OptiTypePipeline.py#L286
Then:
https://github.com/FRED-2/OptiType/blob/master/OptiTypePipeline.py#L294
Then, I think it fails at:
https://github.com/FRED-2/OptiType/blob/master/OptiTypePipeline.py#L298
For whatever reason, it looks like https://github.com/FRED-2/OptiType/blob/master/OptiTypePipeline.py#L288 didn't result in a BAM being created?
I also see that the BAMs get removed when done, which explains why the other successes don't have BAMs there.
This is a dup. of what @armish hit in RCC: https://github.com/hammerlab/rcc-analyses/issues/104
Leaving it open since this is the more general repo.
Tried running this manually in the VM. Some more information:
Aborted (core dumped)
0:09:19.04 Mapping 0e8070747629c84c18f763603fea9545_checkpoint-trials_lung_tumor_bams_mafs_SN0109695_bams_new_samples_AG538184-7_bamAG538184-7-b2fq-PE_R2.fastq to GEN reference...
/nfs-pool/biokepi/toolkit/biopam-kit/opam_dir/opam-root-root-optitype.1.0.0/0.0.0/build/seqan.2.1.0/include/seqan/basic/basic_exception.h:363 FAILED! (Uncaught exception of type std::bad_alloc: std::bad_alloc)
stack trace:
0 [0x72c93d]
1 [0x75a146]
2 [0x75a191]
3 [0x75b149]
4 [0x74f99c]
5 [0x4091c4]
6 [0x47846a]
7 [0x4d2bb2]
8 [0x72bc38]
9 [0x401b23]
10 [0x810dc6]
11 [0x810fba]
12 [0x404cd9]
Aborted (core dumped)
0:18:12.60 Generating binary hit matrix.
Traceback (most recent call last):
File "<string>", line 267, in <module>
File "hlatyper.py", line 177, in pysam_to_hdf
File "pysam/calignmentfile.pyx", line 333, in pysam.calignmentfile.AlignmentFile.__cinit__ (pysam/calignmentfile.c:4808)
File "pysam/calignmentfile.pyx", line 533, in pysam.calignmentfile.AlignmentFile._open (pysam/calignmentfile.c:7027)
IOError: file `tumor_dna_processing_AG538184-7/2017_02_10_16_29_59/2017_02_10_16_29_59_1.bam` not found
OptiTypePipeline returned -1
(/nfs-pool/biokepi//toolkit/biopam-kit/envs/optitype.1.0.0) opam@115e92c3b8c5:/nfs-pool-16/biokepi/work/results-b37decoy-tumor_dna_processing_AG538184-7/a8365d6f969a40ef3f6fa69c0a56ed62tumor_dna_processing_AG538184-7DNA0e8070747629c84c18f763603fea9545_checkpoint-trials_lung_tumor_bams_mafs_SN0109695_bams_new_samples_AG538184-7_bamAG538184-7-b2fq-PE_R1_fastqoptitype.d$
Seems like an OOM situation. Looks like the 10 that failed, at first glance, were relatively large FASTQ
s; this might be relevant. https://github.com/seqan/seqan/issues/1276
You can try remaking the cluster with bigger nodes? Or is there an argument you can pass to Optitype that tells it to use all 52GB of the default nodes?
@ihodes first trying manually on a beefed up node; but if that works, how do I remake the cluster with bigger nodes?
I'm not sure to be honest; you might be able to change it from the GCloud GKE interface, or you could take down the cluster you have and start a new one with different node type… @smondet do you know?
@ihodes I've never tried to change the machine type "live"
The machine-type is an option of coclobas configure ...
and then each job requests some amount of CPUs/Memory; so the Biokepi.Machine.t
has to ask for more also in its run_program
(right now we use the defaults everywhere)
Confirmed that this is a memory issue: when running the same commands manually on 30GB memory vs. 120GB memory, it fails on the former and succeeds on the latter.
Do we know if we can filter reads to the MHC locus and save a lot of space? If so, we should add this filtering step to the pipeline in Biokepi
@ihodes see https://github.com/hammerlab/biokepi/issues/423; I don't think that would address these memory issues, because that filtering would be via razerS3, which is also where the OOM is within OptiType.
Fair enough; I wonder if we could use BWA-mem to do this filtering instead?
@ihodes probably, though it's not OptiType's recommendation:
You can use any read mapper to do this step, although we suggest you use RazerS3. Its only drawback is that due to way RazerS3 was designed, it loads all reads into memory, which could be a problem on older, low-memory computing nodes.
Per @smondet's instructions, I ran on larger cluster nodes as follows:
# Ctrl-C in the Coclobas-server screen tab
coclobas cluster delete --root /coclo/_cocloroot/
coclobas configure --root _cocloroot/ --cluster-name $CLUSTER_NAME --cluster-zone $GCLOUD_ZONE --max-nodes $CLUSTER_MAX_NODES --machine-type n1-standard-32
screen -t Coclobas-server coclobas start-server --root _cocloroot/ --port 8082 # Don't use start-all; this will overwrite the coclobas configure command
Replaced my biokepi_machine.ml
with his new one, which adds support for customizing CPU/memory limits: https://github.com/hammerlab/coclobas/blob/f690ab74f1ce88ccb75d047c87e7f4eb314f7ba7/tools/docker/biokepi_machine.ml
And then:
export KUBE_JOB_CPUS=32
export KUBE_JOB_MEMORY=118
Confirmed that my GCP instance group had the right node type. Then re-ran my jobs.
We'll see if that works!
Success!
Spoke too soon. 1 out of the 9 remaining jobs still failed with the same error :(.
Is that a larger FASTQ than the others, by any chance?
@ihodes it's 81GB, which I didn't think was particularly larger, but I could be misremembering.
@ihodes I was wrong; it is the largest one. Sigh. At least the problem is clear, but I'm becoming more convinced by your suggestion to filter using non-razerS3.
It may be the only way forward… or you switch to 250+GB machines for extremely expensive runs. https://github.com/hammerlab/coclobas/issues/19 will also help with degenerate cases like these in the future.
@ihodes yeah already kicked off a 208GB machine run. Let's see if that works.
It worked!
I've been experiencing the same error that @tavinathanson described here with a set of files I'm working with, but it doesn't appear to be an issue with memory - requesting a machine with increased memory doesn't not eliminate the problem, and I've been able to run Optitype without error on larger fastq files from a different dataset without this problem. Further, when I try to run the razerS command on the command line, it doesn't return an error, but still doesn't produce a bam file.
I'm at a bit of a loss for what to do. Any ideas as to what the problem may be?
@maryawood: unfortunately still sounds like a memory issue or something related to it. Depending on the depth/coverage of your sequencing data, the memory requirements for razer3 can go through the roof and since this is related to the way razer3 keeps the data in the memory, there is very little you can do.
I have been experimenting with different approaches and I found that using bwa mem
to filter down the reads both makes the pipeline run much faster and with very little memory footprint; and testing this approach over a largish cohort of patients (~100), I found that bwa-pre-filtering doesn't really bias or affect the result in any way.
Here is the modified pipeline:
bwa index $OPTITYPE_HOME/data/hla_reference_dna
F4
). You can do this for each pair individually: e.g. bwa mem $OPTITYPE_HOME/data/hla_reference_dna your.pair1.fastq | samtools fastq -F4 - filtered.hla.pair1.fastq
@armish thanks so much for the suggestion! I will give this a try
Trying with @armish's setup (since mine didn't work; see https://github.com/hammerlab/biokepi/issues/418), I get some of these, which I believe are issues with OptiType itself:
Digging a little deeper, I noticed: