rgcgithub / regenie

regenie is a C++ program for whole genome regression modelling of large genome-wide association studies.
https://rgcgithub.github.io/regenie
Other
189 stars 55 forks source link

Possible issue with DNANexus + UKBiobank #365

Open ejgardner-insmed opened 1 year ago

ejgardner-insmed commented 1 year ago

Hello,

I wanted to report a possible issue that I can't quite nail down. The issue is both complex and limited to a very specific use-case within the UKBiobank platform supplied by DNANexus. It requires the following to cause the issue:

In brief, dxfuse creates a FUSE mount of a user's DNANexus project so that data can be streamed rather than downloaded, allowing for the use of AWS VMs with smaller HDD footprints. When streaming data, the resulting issue will likely happen:

enqueue_thread invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
ALERT Out of memory: Killed process 20606 (regenie) total-vm:21935360kB, anon-rss:20758552kB, file-rss:0kB, shmem-rss:0kB, UID:100000 pgtables:40732kB oom_score_adj:0

I have seen this issue documented by another user (#308) where they indicated they were using REGENIE with the swiss-army-knife tool which, by default, streams rather than downloads genetic data.

To be very clear this issue DOES NOT HAPPEN when downloading rather than streaming data! I have tested REGENIE under several different run-time conditions (download or streaming) and can only get the issue to happen when using dxfuse.

Interestingly, the developer of another widely-used association tool (BOLT-LMM) also ran into an issue under similar, but not identical, conditions and issued an update to rectify the problem:

https://community.dnanexus.com/s/question/0D5t000003yKtuDCAS/boltlmm-v24-update-for-rap

So I am wondering if a similar issue could be happening here? While I recognise that this may be a bit of a niche issue, considering that most users will likely be using the 'swiss-army-knife' tool to run REGENIE, this may effect a large number of users?

I hope I have been clear, and happy to explain further if some/all of the above is opaque.

dvh13 commented 1 year ago

is the streaming input/output speed too slow for regenie? we found pointing regenie (using 64 thread machine) to a standard google storage bucket the i/o is too slow and it crashes. need to copy the file to local VM disk first.

ejgardner-insmed commented 1 year ago

Thanks David for the additional info – seems similar to what I am experiencing on DNANexus.

ejgardner-insmed commented 1 year ago

P.S. This issue is a problem because it significantly increases the size / cost of the AWS instance that one has to request for primary analysis since the size of some data (particularly the imputed SNPs) is > 2Tb.

dvh13 commented 1 year ago

you need a --allow-crap-io-speed option or just run it on 1 thread.

joellembatchou commented 1 year ago

Hi,

Which file format (BED/PGEN/BGEN) does this occur with and have you checked it occurs with all of them? Also can you confirm it does not occur if you use the streaming option with a single thread?

Cheers, Joelle

ejgardner-insmed commented 1 year ago

Hi Joelle,

I have only tested with BGEN thus far (the raw imputed data provided by UK Biobank – field 22828) so cannot comment on BED/PGEN. BGEN is not so much a problem for me as the data provided by UKBiobank is small enough that I just download it directly to the AWS node every time I use it.

I also cannot directly confirm issues with a single thread. How I currently have REGENIE running is by dividing each chromosome into smaller chunks, and the running these chunks multi-threaded on a single node. I have reported this issue to DNA Nexus and received the following response:

Thank you for letting us know about this issue. The question is very interesting and uncommon so it took us some time to review. There is indeed a memory overhead when using dxfuse: https://github.com/dnanexus/dxfuse/blob/master/doc/Internals.md#sequential-prefetch

dxfuse create a POSIX-like interface of a project so it would most benefit the users working on Cloud Workstation, notebooks, or debugging. If the app have access to the input file, reading files from dxfuse mounted point is no better than streaming file from dx toolkit (downloading). So I would recommend you to stick with the downloading file strategy for now.

Their latter point regarding speed isn't really the point of my use of DXFuse (the point is to save HDD space), so a solution would still be nice.

Let me know if I can offer any additional help!

joellembatchou commented 1 year ago

I see, in the current implementation BGEN format file cannot be streamed as we perform multiple accesses to the file. I have made a note of this but will probably not be addressed anytime soon as we are working on other changes.

Cheers, Joelle

ejgardner-insmed commented 1 year ago

Thanks Joelle!

iamyingzhou commented 1 year ago

As the error message indicated a lack of memory, I resolved this issue by using an instancetype with larger memory capacity.

soumickmj commented 8 months ago

On DNANexus, If you are using Swiss Army Knife (SAK) to launch regenie, and would like to use the BGEN files, then instead of supplying the paths to regenie using /mnt/project, supply them as inputs to the SAK. Or, you could use the PLINK files and it will work without any issue. Instead of OOM, you might also get an error like:

terminate called after throwing an instance of 'std::__cxx11::basic_string<char, std::char_traits, std::allocator >'

This is also caused by the same issue and can be resolved using either of the earlier mentioned techniques.

So, for example:

BGEN_FOLDER="/Bulk/Exome\ sequences/Population\ level\ exome\ OQFE\ variants,\ BGEN\ format\ -\ final\ release"
path_to_500kwes_helper_files="/mnt/project/Bulk/Exome\ sequences/Population\ level\ exome\ OQFE\ variants,\ PLINK\ format\ -\ final\ release/helper_files"

run_regenie_step2_burden="regenie --step 2 \
    --bgen ukb23159_c${chr}_b0_v1.bgen \
    --ref-first \
    --sample ukb23159_c${chr}_b0_v1.sample \
    --phenoFile /mnt/project${phenoFile} \
    --covarFile /mnt/project${covarFile} \
    --pred /mnt/project${step1out}/regenie_step1_out_pred.list \
    --phenoColList ${phenoColList} \
    --covarColList ${covarColList} \
    --catCovarList ${catCovarList} \
    --minMAC ${minMAC} \
    --set-list /mnt/project${setsfile} \
    --anno-file "${path_to_500kwes_helper_files}/ukb23158_500k_OQFE.annotations.txt.gz" \
    --mask-def  /mnt/project${maskdefFile} \
    --aaf-bins ${aafbins} \
    --build-mask ${maskmode} \
    --out burden_regenie_chr${chr} \
    --maxCatLevels 10 --bsize 200 --qt --threads 16 --gz"

full_cmd="dx run swiss-army-knife \
            -iin=\"${BGEN_FOLDER}/ukb23159_c${chr}_b0_v1.bgen\" \
            -iin=\"${BGEN_FOLDER}/ukb23159_c${chr}_b0_v1.sample\" \
            -iin=\"${step1out}/regenie_step1_out_pred.list\" \
            -icmd=\"${run_regenie_step2_burden}\" \
            --name=\"regenie_step2_burden_chr${chr}\" \
            --tag=\"regenie_step2_burden_chr${chr}\" \
            --instance-type=\"mem1_ssd1_v2_x16\" \
            --destination=\"${outpath}/${outtag}_regenie_step2_burden_results/\" --brief --yes"

  for file in $(dx ls "${step1out}" | grep '.loco.gz$'); do
      full_cmd+=" -iin=\"${step1out}/${file}\" "
  done

  eval $full_cmd

Here, instead of supplying the the bgen file directly to regenie using /mnt/project, it's supplied as an input to the SAK worker and then accessed locally.

soumickmj commented 8 months ago

This issue is also there when you work with PLINK files (BED/BIM/FAM). But, the manifestation of the problem is different, and potentially more dengarous. Supplying PLINK files using /mnt/project does not cause OOM, rather it stops processing genes at a certain point, without any error or warning.