OptiType fails sometimes with BAM not found #419

Closed tavinathanson closed 7 years ago

tavinathanson commented 7 years ago

Trying with @armish's setup (since mine didn't work; see https://github.com/hammerlab/biokepi/issues/418), I get some of these, which I believe are issues with OptiType itself:

### Kube-Job 5ae7ce4b-d459-5b3c-a30d-6cd9528d6291
### Freshness: Fresh
### Output:

Linux 5ae7ce4b-d459-5b3c-a30d-6cd9528d6291 3.16.0-4-amd64 #1 SMP Debian 3.16.36-1+deb8u2 (2016-10-19) x86_64 x86_64 x86_64 GNU/Linux
No export var
/tmp/_MEIq1H09H/matplotlib/font_manager.py:273: UserWarning: Matplotlib is building the font cache using fc-list. This may take a moment.

0:00:00.47 Mapping f003a1843a6c739551bcfd981af8afd7_checkpoint-trials_lung_tumor_bams_mafs_SN0110394_bams_mafs_old_samples_IN_MCC_00234_T1_bamIN_MCC_00234_T1-b2fq-PE_R1.fastq to GEN reference...

0:14:11.61 Mapping f003a1843a6c739551bcfd981af8afd7_checkpoint-trials_lung_tumor_bams_mafs_SN0110394_bams_mafs_old_samples_IN_MCC_00234_T1_bamIN_MCC_00234_T1-b2fq-PE_R2.fastq to GEN reference...

0:27:58.74 Generating binary hit matrix.
Traceback (most recent call last):
  File "<string>", line 267, in <module>
  File "hlatyper.py", line 177, in pysam_to_hdf
  File "pysam/calignmentfile.pyx", line 333, in pysam.calignmentfile.AlignmentFile.__cinit__ (pysam/calignmentfile.c:4808)
  File "pysam/calignmentfile.pyx", line 533, in pysam.calignmentfile.AlignmentFile._open (pysam/calignmentfile.c:7027)
IOError: file `tumor_dna_processing_IN_MCC_00234/2017_02_09_10_38_37/2017_02_09_10_38_37_1.bam` not found
OptiTypePipeline returned -1

Digging a little deeper, I noticed:

tavinathanson commented 7 years ago

So I don't think it's related to getting BAMs as input, since it's not following that code path.

Rather, it appears to do this:




Then, I think it fails at:


tavinathanson commented 7 years ago

For whatever reason, it looks like https://github.com/FRED-2/OptiType/blob/master/OptiTypePipeline.py#L288 didn't result in a BAM being created?

I also see that the BAMs get removed when done, which explains why the other successes don't have BAMs there.

tavinathanson commented 7 years ago

This is a dup. of what @armish hit in RCC: https://github.com/hammerlab/rcc-analyses/issues/104

Leaving it open since this is the more general repo.

tavinathanson commented 7 years ago

Tried running this manually in the VM. Some more information:

Aborted (core dumped)

0:09:19.04 Mapping 0e8070747629c84c18f763603fea9545_checkpoint-trials_lung_tumor_bams_mafs_SN0109695_bams_new_samples_AG538184-7_bamAG538184-7-b2fq-PE_R2.fastq to GEN reference...
/nfs-pool/biokepi/toolkit/biopam-kit/opam_dir/opam-root-root-optitype.1.0.0/0.0.0/build/seqan.2.1.0/include/seqan/basic/basic_exception.h:363 FAILED!  (Uncaught exception of type std::bad_alloc: std::bad_alloc)

stack trace:
  0                      [0x72c93d]
  1                      [0x75a146]
  2                      [0x75a191]
  3                      [0x75b149]
  4                      [0x74f99c]
  5                      [0x4091c4]
  6                      [0x47846a]
  7                      [0x4d2bb2]
  8                      [0x72bc38]
  9                      [0x401b23]
 10                      [0x810dc6]
 11                      [0x810fba]
 12                      [0x404cd9]

Aborted (core dumped)

0:18:12.60 Generating binary hit matrix.
Traceback (most recent call last):
  File "<string>", line 267, in <module>
  File "hlatyper.py", line 177, in pysam_to_hdf
  File "pysam/calignmentfile.pyx", line 333, in pysam.calignmentfile.AlignmentFile.__cinit__ (pysam/calignmentfile.c:4808)
  File "pysam/calignmentfile.pyx", line 533, in pysam.calignmentfile.AlignmentFile._open (pysam/calignmentfile.c:7027)
IOError: file `tumor_dna_processing_AG538184-7/2017_02_10_16_29_59/2017_02_10_16_29_59_1.bam` not found
OptiTypePipeline returned -1
(/nfs-pool/biokepi//toolkit/biopam-kit/envs/optitype.1.0.0) opam@115e92c3b8c5:/nfs-pool-16/biokepi/work/results-b37decoy-tumor_dna_processing_AG538184-7/a8365d6f969a40ef3f6fa69c0a56ed62tumor_dna_processing_AG538184-7DNA0e8070747629c84c18f763603fea9545_checkpoint-trials_lung_tumor_bams_mafs_SN0109695_bams_new_samples_AG538184-7_bamAG538184-7-b2fq-PE_R1_fastqoptitype.d$
tavinathanson commented 7 years ago

Seems like an OOM situation. Looks like the 10 that failed, at first glance, were relatively large FASTQs; this might be relevant. https://github.com/seqan/seqan/issues/1276

ihodes commented 7 years ago

You can try remaking the cluster with bigger nodes? Or is there an argument you can pass to Optitype that tells it to use all 52GB of the default nodes?

tavinathanson commented 7 years ago

@ihodes first trying manually on a beefed up node; but if that works, how do I remake the cluster with bigger nodes?

ihodes commented 7 years ago

I'm not sure to be honest; you might be able to change it from the GCloud GKE interface, or you could take down the cluster you have and start a new one with different node type… @smondet do you know?

smondet commented 7 years ago

@ihodes I've never tried to change the machine type "live"

The machine-type is an option of coclobas configure ... and then each job requests some amount of CPUs/Memory; so the Biokepi.Machine.t has to ask for more also in its run_program (right now we use the defaults everywhere)

tavinathanson commented 7 years ago

Confirmed that this is a memory issue: when running the same commands manually on 30GB memory vs. 120GB memory, it fails on the former and succeeds on the latter.

ihodes commented 7 years ago

Do we know if we can filter reads to the MHC locus and save a lot of space? If so, we should add this filtering step to the pipeline in Biokepi

tavinathanson commented 7 years ago

@ihodes see https://github.com/hammerlab/biokepi/issues/423; I don't think that would address these memory issues, because that filtering would be via razerS3, which is also where the OOM is within OptiType.

ihodes commented 7 years ago

Fair enough; I wonder if we could use BWA-mem to do this filtering instead?

tavinathanson commented 7 years ago

@ihodes probably, though it's not OptiType's recommendation:

You can use any read mapper to do this step, although we suggest you use RazerS3. Its only drawback is that due to way RazerS3 was designed, it loads all reads into memory, which could be a problem on older, low-memory computing nodes.

tavinathanson commented 7 years ago

Per @smondet's instructions, I ran on larger cluster nodes as follows:

# Ctrl-C in the Coclobas-server screen tab
coclobas cluster delete --root /coclo/_cocloroot/
coclobas configure --root _cocloroot/ --cluster-name $CLUSTER_NAME --cluster-zone $GCLOUD_ZONE --max-nodes $CLUSTER_MAX_NODES --machine-type n1-standard-32
screen -t Coclobas-server coclobas start-server --root _cocloroot/ --port 8082 # Don't use start-all; this will overwrite the coclobas configure command

Replaced my biokepi_machine.ml with his new one, which adds support for customizing CPU/memory limits: https://github.com/hammerlab/coclobas/blob/f690ab74f1ce88ccb75d047c87e7f4eb314f7ba7/tools/docker/biokepi_machine.ml

And then:

export KUBE_JOB_CPUS=32
export KUBE_JOB_MEMORY=118

Confirmed that my GCP instance group had the right node type. Then re-ran my jobs.

We'll see if that works!

tavinathanson commented 7 years ago


tavinathanson commented 7 years ago

Spoke too soon. 1 out of the 9 remaining jobs still failed with the same error :(.

ihodes commented 7 years ago

Is that a larger FASTQ than the others, by any chance?

tavinathanson commented 7 years ago

@ihodes it's 81GB, which I didn't think was particularly larger, but I could be misremembering.

tavinathanson commented 7 years ago

@ihodes I was wrong; it is the largest one. Sigh. At least the problem is clear, but I'm becoming more convinced by your suggestion to filter using non-razerS3.

ihodes commented 7 years ago

It may be the only way forward… or you switch to 250+GB machines for extremely expensive runs. https://github.com/hammerlab/coclobas/issues/19 will also help with degenerate cases like these in the future.

tavinathanson commented 7 years ago

@ihodes yeah already kicked off a 208GB machine run. Let's see if that works.

tavinathanson commented 7 years ago

It worked!

maryawood commented 6 years ago

I've been experiencing the same error that @tavinathanson described here with a set of files I'm working with, but it doesn't appear to be an issue with memory - requesting a machine with increased memory doesn't not eliminate the problem, and I've been able to run Optitype without error on larger fastq files from a different dataset without this problem. Further, when I try to run the razerS command on the command line, it doesn't return an error, but still doesn't produce a bam file.

I'm at a bit of a loss for what to do. Any ideas as to what the problem may be?

armish commented 6 years ago

@maryawood: unfortunately still sounds like a memory issue or something related to it. Depending on the depth/coverage of your sequencing data, the memory requirements for razer3 can go through the roof and since this is related to the way razer3 keeps the data in the memory, there is very little you can do.

I have been experimenting with different approaches and I found that using bwa mem to filter down the reads both makes the pipeline run much faster and with very little memory footprint; and testing this approach over a largish cohort of patients (~100), I found that bwa-pre-filtering doesn't really bias or affect the result in any way.

Here is the modified pipeline:

maryawood commented 6 years ago

@armish thanks so much for the suggestion! I will give this a try