pachterlab / kb_python

A wrapper for the kallisto | bustools workflow for single-cell RNA-seq pre-processing
https://www.kallistobus.tools/
BSD 2-Clause "Simplified" License
141 stars 24 forks source link

kb count does not honor -m (i.e. memory option) #181

Closed molecules closed 1 year ago

molecules commented 1 year ago

Describe the issue kb count doesn't seem to honor the -m flag (at least not below 48G).

What is the exact command that was run?

kb count \
        --loom \
        -x 10XV2 \
        --tmp temp_dir \
        -i transcriptome.idx \
        -g transcripts_to_genes.txt \
        -c1 cdna_t2c.txt \
        -c2 intron_t2c.txt \
        --workflow lamanno \
        -m 10G \
        -t 8 \
        -o out \
        $( echo $(ls fastq/sample_S*_R?_001.fastq.gz) | sort | xargs )

Command output (with --verbose flag)

[2022-12-14 18:03:28,716]   DEBUG Printing verbose output
[2022-12-14 18:03:28,726]   DEBUG kallisto binary located at /share/conda/envs/kallisto-bustools/lib/python3.10/site-packages/kb_python/bins/linux/kallisto/kallisto
[2022-12-14 18:03:28,727]   DEBUG bustools binary located at /share/conda/envs/kallisto-bustools/lib/python3.10/site-packages/kb_python/bins/linux/bustools/bustools
[2022-12-14 18:03:28,727]   DEBUG Creating 10G_mem_sample_temp_dir directory
[2022-12-14 18:03:28,770]   DEBUG Namespace(list=False, command='count', tmp='10G_mem_sample_temp_dir', keep_tmp=False, verbose=True, i='refs/transcriptome.idx', g='refs/transcripts_to_genes.txt', x='10XV2', o='10G_mem_sample_out', w=None, t=8, m='10G', workflow='lamanno', mm=False, tcc=False, filter=None, c1='refs/cdna_t2c.txt', c2='refs/intron_t2c.txt', overwrite=False, dry_run=False, lamanno=False, nucleus=False, loom=True, h5ad=False, cellranger=False, report=False, no_inspect=False, no_validate=False, fastqs=['fastq/sample_S3_L001_R1_001.fastq.gz', 'fastq/sample_S3_L001_R2_001.fastq.gz', 'fastq/sample_S3_L002_R1_001.fastq.gz', 'fastq/sample_S3_L002_R2_001.fastq.gz', 'fastq/sample_S3_L003_R1_001.fastq.gz', 'fastq/sample_S3_L003_R2_001.fastq.gz', 'fastq/sample_S3_L004_R1_001.fastq.gz', 'fastq/sample_S3_L004_R2_001.fastq.gz'])
[2022-12-14 18:03:28,800]    INFO Using index refs/transcriptome.idx to generate BUS file to 10G_mem_sample_out from
[2022-12-14 18:03:28,800]    INFO         fastq/sample_S3_L001_R1_001.fastq.gz
[2022-12-14 18:03:28,800]    INFO         fastq/sample_S3_L001_R2_001.fastq.gz
[2022-12-14 18:03:28,800]    INFO         fastq/sample_S3_L002_R1_001.fastq.gz
[2022-12-14 18:03:28,801]    INFO         fastq/sample_S3_L002_R2_001.fastq.gz
[2022-12-14 18:03:28,801]    INFO         fastq/sample_S3_L003_R1_001.fastq.gz
[2022-12-14 18:03:28,801]    INFO         fastq/sample_S3_L003_R2_001.fastq.gz
[2022-12-14 18:03:28,801]    INFO         fastq/sample_S3_L004_R1_001.fastq.gz
[2022-12-14 18:03:28,801]    INFO         fastq/sample_S3_L004_R2_001.fastq.gz
[2022-12-14 18:03:28,802]   DEBUG kallisto bus -i refs/transcriptome.idx -o 10G_mem_sample_out -x 10XV2 -t 8 fastq/sample_S3_L001_R1_001.fastq.gz fastq/sample_S3_L001_R2_001.fastq.gz fastq/sample_S3_L002_R1_001.fastq.gz fastq/sample_S3_L002_R2_001.fastq.gz fastq/sample_S3_L003_R1_001.fastq.gz fastq/sample_S3_L003_R2_001.fastq.gz fastq/sample_S3_L004_R1_001.fastq.gz fastq/sample_S3_L004_R2_001.fastq.gz

From top, we can see that despite specifying -m 10G, it's using 48G of RAM:

top - 18:03:54 up 24 days,  1:20,  0 users,  load average: 4.57, 4.36, 5.24
Tasks: 784 total,   3 running, 781 sleeping,   0 stopped,   0 zombie
%Cpu(s):  2.9 us,  0.4 sy,  0.0 ni, 96.6 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem : 257866.9 total,   6958.0 free,  66465.5 used, 184443.5 buff/cache
MiB Swap:   8192.0 total,   7979.2 free,    212.8 used. 188551.2 avail Mem 

    PID USER  PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                                            
1096528 user  20   0   48.0g  48.0g   4072 R  99.7  19.1   0:25.38 /share/conda/envs/kallisto-bustools/lib/python3.10/site-packages/kb_python/bins/linux/kallisto/+
1096543 user  20   0   15648   4844   3276 R   0.3   0.0   0:00.11 /usr/bin/top -u user -c                                                                                        
1096486 user  20   0    2612    532    460 S   0.0   0.0   0:00.00 /bin/sh /var/spool/slurmd/job1726118/slurm_script                                                                  
1096518 user  20   0 1148276 107484  37656 S   0.0   0.0   0:01.14 /share/conda/envs/kallisto-bustools/bin/python /share/conda/envs/kallisto-bustools/bin/kb count+

This initial 48G of RAM used actually seems invariant regardless of whether -m is higher or lower than 48G. However, the maximum RAM above 48G can go much higher if the -m flag is higher. In fact, it seems to soak up just about all the RAM you allow with -m.

Yenaled commented 1 year ago

It's impossible currently to load in a cDNA+intron index at once in less than 40 gb memory. The memory option only applies to the bustools step -- the kallisto mapping step uses as much memory as kallisto needs (which is usually the size of the index).

github-actions[bot] commented 1 year ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days