stjude / XenoCP

A cloud-based tool for mouse read cleansing in xenograft samples
Apache License 2.0
5 stars 3 forks source link

Limit the memory usage of Xenocp #44

Closed zhaowsong closed 11 months ago

zhaowsong commented 12 months ago

Dear authors I am using Singularity to run Xenocp with a request for 30 cores and 240GB of memory. Unfortunately, for some large WGS BAM files, the execution fails due to memory issues. Below, you'll find the relevant code snippet and the error message. I am wondering if there is a way to limit the memory usage of Xenocp.

image image image

Best regards, Zhao

adthrasher commented 12 months ago

In the mode that you're invoking XenoCP, it is running on a single host. It appears that you have cores allocated across multiple nodes (at least, node073, node132, and node144 based on the error message) in a slurm environment. Singularity and XenoCP do not have access to the distributed nodes. It is likely that your memory request is also spread across the nodes.

adthrasher commented 11 months ago

In #49, I added docker container specifications to the CWL. It may now be possible to run the individual steps distributed across a computing environment, like Slurm, but this would be experimental as I don't have a Slurm cluster to test. There is also a WDL implementation (https://github.com/stjude/XenoCP#wdl-workflow) which should be runnable on a Slurm cluster using miniwdl and the miniwdl-slurm extension. The WDL implementation would also distribute tasks across multiple nodes reducing the memory footprint on any given node.

bounlu commented 9 months ago

I second this. XenoCP is very greedy to utilise RAM, especially when STAR is chosen as the aligner, it launches many jobs simultaneously with default parameters until they eventually get killed. cwl-runner re-tries to run them in an infinite loop until the user kills the docker container.

Is there a parameter to limit memory, cpu or jobs when running the Docker option without changing the hardcode?

STAR version: 2.7.10a   compiled: 2022-01-14T18:50:00-05:00 :/home/dobin/data/STAR/STARcode/STAR.master/source
Feb 06 09:26:47 ..... started STAR run
Feb 06 09:26:47 ..... loading genome
Feb 06 09:29:13 ..... started 1st pass mapping
Feb 06 09:29:13 ..... started 1st pass mapping
Feb 06 09:29:13 ..... started 1st pass mapping
Feb 06 09:29:14 ..... started 1st pass mapping
/usr/local/bin/star_onlymapped.sh: line 70: 18868 Killed                  STAR --genomeDir /tmp/ekwaxs30/stg61f5372b-73d7-41c2-bb30-8fddddef93cd/star_index_mouse --readFilesCommand zcat --readFilesIn /tmp/ld7dlepp/21.fastq.gz --runMode alignReads --runThreadN 128 --outSAMunmapped Within --outSAMstrandField intronMotif --outSAMtype BAM Unsorted --outSAMattributes NH HI AS nM NM MD XS --outFilterMultimapScoreRange 1 --outFilterMultimapNmax 20 --outFilterMismatchNmax 10 --alignIntronMax 500000 --alignMatesGapMax 1000000 --sjdbScore 2 --alignSJDBoverhangMin 1 --outFilterMatchNminOverLread 0.66 --outFilterScoreMinOverLread 0.66 --outFileNamePrefix 21.fastq.contam. --twopassMode Basic --limitBAMsortRAM 3000000000
INFO [job mapping-star_14] Max memory used: 33393MiB
WARNING [job mapping-star_14] exited with status: 137
ERROR [job mapping-star_14] Job error:
("Error collecting output for parameter 'bam': opt/xenocp/cwl/star_onlymapped.cwl:45:11: Did not find output file with glob pattern: ['21.fastq.contam.bam'].", {})
WARNING [job mapping-star_14] completed permanentFail
mcrusch commented 9 months ago

@bounlu would you consider opening a new ticket and adding details about your environment and how you're running it? It seems like it might be a bit different than the case here, especially as this ticket was about WGS, and you're running STAR. STAR does use a lot of memory just by itself, so it provides a different challenge.

bounlu commented 9 months ago

This was due to the --parallel parameter for the cwl-runner. Because cwl-runner tries to run as many STAR jobs as possible which causes to exhaust the memory. When I removed that option, the jobs run sequentially (and significantly slower) but at least succeeds. I am not sure if there is a parameter to further control the number of jobs or cpus/cores/threads for the --parallel parameter in the cwl-runner. The document says --parallel is experimental though.