hartwigmedical / hmftools

Various algorithms for analysing genomics data
GNU General Public License v3.0
193 stars 59 forks source link

GRIPSS excessive RAM usage #210

Closed keiranmraine closed 2 years ago

keiranmraine commented 3 years ago

We are seeing GRIPSS jobs failing when given 512GB of RAM. Are there any plans to try to reduce the memory footprint to a more manageable level. Generally jobs complete in 30-60GB so it's not a wide spread issue but thought I'd mention it.

Thanks for supporting this tool

(edit, changed GRIDSS to GRIPSS)

d-cameron commented 3 years ago

Looks like a duplicate of https://github.com/PapenfussLab/gridss/issues/525

TLDR: depending on which step it's failing at, you'll need to set --otherjvmheap (which has a default of 4GB), not -jvmheap (which has a default of 31GB).

d-cameron commented 3 years ago

Note that these limits are different to any job scheduler limits. GRIDSS memory usage limits are controlled by --jvmheap/--otherjvmheap and don't directly care how much ram is available on the host/has been allocated by the scheduler.

GRIDSS designed to fit into a 32GB memory footprint. With a human sized genome, --otherjvmheap can safely be increased to around 22GB without hitting a 32GB memory usage limit. This limit is lower than --jvmheap those GRIDSS processes run at the same time as the external tools such as bwa mem so peak memory usage is approximately otherjvmheap + non-heap java overhead + bwa memory footprint.

d-cameron commented 3 years ago

Are there any plans to try to reduce the memory footprint to a more manageable level

I'm working on a faster/lower memory usage assembler, as well as optimisations for FFPE samples. FFPE samples are are considerably more expensive both for CPU and memory than fresh frozen samples.

keiranmraine commented 3 years ago

I realised I titled this incorrectly it should titled GRIPSS, updated.

We already set -Xms/-Xmx for GRIDSS.

keiranmraine commented 3 years ago

Example command, -Xmx being aligned to the resource requested on our scheduler this example 256G:

java -Xms3500m -Xmx256000m com.hartwig.hmftools.gripss.GripssApplicationKt \
-tumor SOME_SAMPLE \
-ref_genome GRCh38_full_analysis_set_plus_decoy_hla/genome.fa \
-breakend_pon GRCh38_full_analysis_set_plus_decoy_hla/gridss/gridss_pon_single_breakend.38.bed \
-breakpoint_pon GRCh38_full_analysis_set_plus_decoy_hla/gridss/gridss_pon_breakpoint.38.bedpe \
-breakpoint_hotspot GRCh38_full_analysis_set_plus_decoy_hla/gridss/known_fusions.38_v3.bedpe \
-input_vcf 2567965.gridss.vcf.gz \
-output_vcf SOME_SAMPLE.gripss.vcf.gz

Command exits with scheduler indicating all memory used:

TERM_MEMLIMIT: job killed after reaching LSF memory usage limit.
Exited with exit code 143.

Resource usage summary:

    CPU time :                                   2700.00 sec.
    Max Memory :                                 256000 MB
    Average Memory :                             152979.94 MB
    Total Requested Memory :                     256000.00 MB

The job itself simply reporting:

Terminated
keiranmraine commented 3 years ago

GRIDSS input vcf has 29,171,635 records.

p-priestley commented 3 years ago

Hi Kieran. Which version of GRIPSS are you using?

keiranmraine commented 3 years ago

GRIPSS v1.11, we build it into a container image for large data CI validation (internally) and then use this same image under singularity on our HPC:

https://github.com/cancerit/gripss

We've processed 5840/5934 GRIDSS outputs with relative ease but were seeing a subset which have this very large memory footprint. I can generate a report of the distribution of memory usage vs input size if useful.

p-priestley commented 3 years ago

30M variants is a lot. Is this an FFPE sample? In GRIPSSv1.11. we made an improvement which should have helped: "Do not attempt transitive linking if there is 500,000+ variants", but apparently you still have a problem.

GRIPSS is actually a very light weight component for us, so we never thought much about the performance. We only have high quality fresh frozen samples, so we have simply never tried a vcf with 30m records.

We intend to do a bit of rewrite at some moment to deal with these issues. The memory usage vs input size may be helpful.

In the meantime you could consider to apply a basic qual filter to your vcf immediately before GRIPSS. I am not sure your exact use case, but you could, for example filter for qual > 100 or 150 to reduce records.

keiranmraine commented 3 years ago

It seems we have a couple of tissues in this project where the event count balloons. Currently determining if we can use a standardised quality filter. I'll try to pull the memory data when I get the chance.

p-priestley commented 3 years ago

It sounds like it would be worth looking into as those variant numbers seem extreme to me.

d-cameron commented 3 years ago

java -Xms3500m -Xmx256000m TERM_MEMLIMIT: job killed after reaching LSF memory usage limit. Terminated

A technicality here: the job got killed immediately after asking for 256Gb heap. Java processes have memory (e.g. call stacks, static constants) that are not stored on the heap. If you've given your scheduler a 256Gb limit which is strictly enforced then your process will get killed as soon as it asks for a 256Gb heap. This is consistent with what happened here as the output was not a JVM OutOfMemory error, but "Terminated".

There's probably still a memory leak and/or algorithm with unbounded memory usage but the actual crash you're experience is due to a JVM heap size being set too large with respect to the scheduler job limits. If you ask for 254Gb heap GRIPSS will take longer to die and you'll get a more informative OutOfMemory exception.

@p-priestley a htsjdk VariantContext object from a GRIDSS VCF uses around ~8kb of memory per record (dominated by INFO/FORMAT field hash tables). Loading an entire GRIDSS VCF into memory currently isn't viable if it's from a FFPE sample as there is a lot of noise in FFPE samples. We'd need to refactor GRIPSS to a two-pass approach: retain just the fields actually required by LINX, then a second pass to actually write the output files.

keiranmraine commented 3 years ago

@d-cameron, I'm a little confused, I don't think I've reversed the uses of the options, but please correct me if I'm missing some other options (my Java has become rusty):

The example above starts with 3500m and is allowed to grow to 256000m.

Although we get aTerminated exit this is probably due to the overhead of the singularity container that this runs within.

keiranmraine commented 3 years ago

@p-priestley is it possible to hard filter on the qual field during ingest of the VCF in the first step or will I need to pre-process the file?

p-priestley commented 3 years ago

There is currently no hard quality filter in GRIPSS but you could easily use bcftools or something similar to filter. perhaps you could first look at the distribution of QUALS in the vcf

zgrep -v '#' XXXXX.vcf.gz | awk '{print $6}' | sort -n | uniq -c | sort -n

We would definitely like to improve this in GRIPPS, but it would help to understand a little why you get 30M GRIDSS variants before we decide how to fix it, since we have not seen a file like this. Knowing the qual distribution would be a good first step.

keiranmraine commented 3 years ago

Please see this GSheet (and chart). Slight modification to your suggested command to minimise the data-points into bins of 5.

zgrep -v '#' XXXXX.vcf.gz | awk '{print int($6/5)*5}' | sort -n | uniq -c

This is liver which is generally noisy.

p-priestley commented 3 years ago

Comparing to our files you seem to have similar counts for QUAL > 300 but much more for the lower quals. GRIPSS should definitely not need this much memory even for 30m variants. We do plan to address but not immediately (likely in about 1 month from now).

In the meantime, pre-filtering is your best option. You can safely filter all variants with QUAL < 100 prior to loading GRIPSS, but you could also try QUAL <200 which would eliminate 90% of your variants. Our normal cutoff for calling is 400, although we do rescue hotspots and linked variants down to qual 100.

jaesvi commented 3 years ago

Hi, I was lurking the issues and read this. Probably not the problem here, but some time ago in a cohort analysis I was unknowingly using BAM files without marked duplicates and this caused an incredible amount of GRIDSS calls in only a subset of the samples. Simply marking duplicates solved the issue. Maybe good to leave it in writing in case it helps someone in the future.

keiranmraine commented 3 years ago

@jaesvi - definitely not due to lack of duplicate removal (automated mapping pipeline), but thanks for the suggestion.

I've filtered for events >=100, and ensured sufficient overhead on the scheduler to ensure Java gives the OOM, not terminated (as indicated by @d-cameron).

The outcome was filtering as above allowed the job to complete, using ~403GB of RAM.

22:37:59 - Config GripssFilterConfig(hardMinTumorQual=100, hardMaxNormalAbsoluteSupport=3, hardMaxNormalRelativeSupport=0.08, softMaxNormalRelativeSupport=0.03, minNormalCoverage=8, minTumorAF=0.005, maxShortStrandBias=0.95, minQualBreakEnd=1000, minQualBreakPoint=400, minQualRescueMobileElementInsertion=1000, maxHomLengthShortInversion=6, maxInexactHomLengthShortDel=5, minLength=32, polyGCRegion=2:32916190-32916630 +   .)
22:37:59 - Running in tumor-only mode
22:37:59 - Using XXXXXX as tumor sample
22:37:59 - Reading hotspot file: /.../GRCh38_full_analysis_set_plus_decoy_hla/gridss/known_fusions.38_v3.bedpe
22:37:59 - Reading VCF file: filtered.gridss.vcf
23:39:24 - Reading PON files: /.../GRCh38_full_analysis_set_plus_decoy_hla/gridss/gridss_pon_single_breakend.38.bed /.../GRCh38_full_analysis_set_plus_decoy_hla/gridss/gridss_pon_breakpoint.38.bedpe
23:50:19 - Applying PON file
23:50:38 - Applying initial soft filters
23:52:06 - Finding assembly links
00:03:24 - Finding transitive links
00:03:32 - Paired break end de-duplication
00:03:48 - Single break end de-duplication
00:56:03 - Finding double stranded break links
03:32:29 - Rescuing linked variants
03:32:53 - Writing file: /.../gripss/XXXXXX.gripss.vcf.gz
03:49:43 - Finished in 18705 seconds
keiranmraine commented 3 years ago

@d-cameron is the regression in 2.12.1 likely to have exacerbated this issue?

d-cameron commented 3 years ago

2.12.1 will give you many more single breakends, but have most real breakpoints filtered due to lack of SR support.

charlesshale commented 2 years ago

I've just released a refactored version of Gripss which significantly reduces its memory usage.

https://github.com/hartwigmedical/hmftools/blob/master/gripss/README.md

and

https://github.com/hartwigmedical/hmftools/releases/tag/gripss-v2.0_beta