pachterlab / kb_python

A wrapper for the kallisto | bustools workflow for single-cell RNA-seq pre-processing
https://www.kallistobus.tools/
BSD 2-Clause "Simplified" License
141 stars 23 forks source link

kb count long run time (run seems stuck) #255

Closed dsb66 closed 1 month ago

dsb66 commented 1 month ago

Describe the issue kb count of 1.6B 10X reads runs for days (run not completed yet) on a 16-thread computer. Based on my previous experience and the run time statement in your preprint, the run should have been completed in less than 2 hours. The verbose output pasted below was started yesterday, but the same command (not in verbose mode) has been running for one week.

The kb count part of the pipeline completes in ~1 hour, then the program keeps running bustools counts with no output accumulating in the counts_unfiltered directory (the files in the directory remains of size 0 bytes).

The run_info.json output is below:

{
        "n_targets": 133085,
        "n_bootstraps": 0,
        "n_processed": 1599636936,
        "n_pseudoaligned": 940980413,
        "n_unique": 190286711,
        "p_pseudoaligned": 58.8,
        "p_unique": 11.9,
        "kallisto_version": "0.50.1",
        "index_version": 13,
        "start_time": "Wed May 29 11:45:57 2024",
        "call": "/opt/bin/miniforge3/envs/kb/lib/python3.12/site-packages/kb_python/bins/linux/kallisto/kallisto bus -i /genomes/GRCm38.100/kb/GRCm38.100.nac.k_idx -o kbVelo_debug_EC_S15 -x 10XV3 -t 20 /fastq/2020_mouse1/FASTQs/EC_S15_L003_R1_001.fastq.gz /fastq/2020_mouse1/FASTQs/EC_S15_L003_R2_001.fastq.gz"
}

What is the exact command that was run?

kb count \
        -i /genomes/GRCm38.100/kb/GRCm38.100.nac.k_idx \
        -g /genomes/GRCm38.100/kb/GRCm38.100.t2g \
        -x "10XV3" \
        -c1 /genomes/GRCm38.100/kb/cdna_t2c.txt \
        -c2 /genomes/GRCm38.100/kb/intron_t2c.txt \
        --verbose \
        -o kbVelo_debug_EC_S15 \
        -t 20 \
        -m 10G \
        --workflow nac \
        --overwrite \
        --sum nucleus \
        --h5ad \
        --gene-names \
        /fastq/2020_mouse1/FASTQs/EC_S15_L003_R1_001.fastq.gz \
        /fastq/2020_mouse1/FASTQs/EC_S15_L003_R2_001.fastq.gz

Command output (with --verbose flag)

[2024-05-29 11:45:42,041]   DEBUG [main] Printing verbose output
[2024-05-29 11:45:44,246]   DEBUG [main] kallisto binary located at /opt/bin/miniforge3/envs/kb/lib/python3.12/site-packages/kb_python/bins/linux/kallisto/kallisto
[2024-05-29 11:45:44,247]   DEBUG [main] bustools binary located at /opt/bin/miniforge3/envs/kb/lib/python3.12/site-packages/kb_python/bins/linux/bustools/bustools
[2024-05-29 11:45:44,247]   DEBUG [main] Creating `kbVelo_debug_EC_S15/tmp` directory
[2024-05-29 11:45:44,252]   DEBUG [main] Namespace(list=False, command='count', tmp=None, keep_tmp=False, verbose=True, i='/genomes/GRCm38.100/kb/GRCm38.100.nac.k_idx', g='/genomes/GRCm38.100/kb/GRCm38.100.t2g', x='10XV3', o='kbVelo_debug_EC_S15', num=False, w=None, r=None, t=20, m='10G', strand=None, inleaved=False, genomebam=False, aa=False, gtf=None, chromosomes=None, workflow='nac', em=False, mm=False, tcc=False, filter=None, filter_threshold=None, c1='/genomes/GRCm38.100/kb/cdna_t2c.txt', c2='/genomes/GRCm38.100/kb/intron_t2c.txt', overwrite=True, dry_run=False, batch_barcodes=False, loom=False, h5ad=True, loom_names='barcode,target_name', sum='nucleus', cellranger=False, gene_names=True, N=None, report=False, no_inspect=False, kallisto='/opt/bin/miniforge3/envs/kb/lib/python3.12/site-packages/kb_python/bins/linux/kallisto/kallisto', bustools='/opt/bin/miniforge3/envs/kb/lib/python3.12/site-packages/kb_python/bins/linux/bustools/bustools', no_validate=False, no_fragment=False, parity=None, fragment_l=None, fragment_s=None, bootstraps=None, matrix_to_files=False, matrix_to_directories=False, fastqs=['/fastq/2020_mouse1/FASTQs/EC_S15_L003_R1_001.fastq.gz', '/fastq/2020_mouse1/FASTQs/EC_S15_L003_R2_001.fastq.gz'])
[2024-05-29 11:45:57,288]    INFO [count_nac] Using index /genomes/GRCm38.100/kb/GRCm38.100.nac.k_idx to generate BUS file to kbVelo_debug_EC_S15 from
[2024-05-29 11:45:57,288]    INFO [count_nac]         /fastq/2020_mouse1/FASTQs/EC_S15_L003_R1_001.fastq.gz
[2024-05-29 11:45:57,289]    INFO [count_nac]         /fastq/2020_mouse1/FASTQs/EC_S15_L003_R2_001.fastq.gz
[2024-05-29 11:45:57,289]   DEBUG [count_nac] kallisto bus -i /genomes/GRCm38.100/kb/GRCm38.100.nac.k_idx -o kbVelo_debug_EC_S15 -x 10XV3 -t 20 /fastq/2020_mouse1/FASTQs/EC_S15_L003_R1_001.fastq.gz /fastq/2020_mouse1/FASTQs/EC_S15_L003_R2_001.fastq.gz
[2024-05-29 11:45:57,392]   DEBUG [count_nac] 
[2024-05-29 11:45:57,392]   DEBUG [count_nac] [bus] Note: Strand option was not specified; setting it to --fr-stranded for specified technology
[2024-05-29 11:46:19,612]   DEBUG [count_nac] [index] k-mer length: 31
[2024-05-29 11:46:37,829]   DEBUG [count_nac] [index] number of targets: 133,085
[2024-05-29 11:46:37,930]   DEBUG [count_nac] [index] number of k-mers: 985,152,882
[2024-05-29 11:46:37,930]   DEBUG [count_nac] [index] number of D-list k-mers: 8,724,506
[2024-05-29 11:46:37,930]   DEBUG [count_nac] [quant] will process sample 1: /fastq/2020_mouse1/FASTQs/EC_S15_L003_R1_001.fastq.gz
[2024-05-29 11:46:37,930]   DEBUG [count_nac] /fastq/2020_mouse1/FASTQs/EC_S15_L003_R2_001.fastq.gz
[2024-05-29 11:46:44,337]   DEBUG [count_nac] [quant] finding pseudoalignments for the reads ...
[2024-05-29 11:46:46,339]   DEBUG [count_nac] [progress] 1M reads processed (59.1% mapped)
[...]
[2024-05-29 12:39:17,220]   DEBUG [count_nac] [progress] 1599M reads processed (58.8% mapped)              done
[2024-05-29 12:39:17,220]   DEBUG [count_nac] [quant] processed 1,599,636,936 reads, 940,980,413 reads pseudoaligned
[2024-05-29 12:39:18,221]   DEBUG [count_nac] 
[2024-05-29 12:39:22,625]    INFO [count_nac] Sorting BUS file kbVelo_debug_EC_S15/output.bus to kbVelo_debug_EC_S15/tmp/output.s.bus
[2024-05-29 12:39:22,625]   DEBUG [count_nac] bustools sort -o kbVelo_debug_EC_S15/tmp/output.s.bus -T kbVelo_debug_EC_S15/tmp -t 20 -m 10G kbVelo_debug_EC_S15/output.bus
[2024-05-29 12:40:12,788]   DEBUG [count_nac] partition time: 8.67857s
[2024-05-29 12:41:12,254]   DEBUG [count_nac] partition time: 8.6916s
[2024-05-29 12:42:03,607]   DEBUG [count_nac] partition time: 6.98062s
[2024-05-29 12:42:18,122]   DEBUG [count_nac] Read in 940980413 BUS records
[2024-05-29 12:42:18,122]   DEBUG [count_nac] reading time 31.3087s
[2024-05-29 12:42:18,122]   DEBUG [count_nac] sorting time 177.286s
[2024-05-29 12:42:18,122]   DEBUG [count_nac] writing time 0s
[2024-05-29 12:43:30,496]    INFO [count_nac] On-list not provided
[2024-05-29 12:43:30,497]    INFO [count_nac] Copying pre-packaged 10XV3 on-list to kbVelo_debug_EC_S15
[2024-05-29 12:43:32,311]    INFO [count_nac] Inspecting BUS file kbVelo_debug_EC_S15/tmp/output.s.bus
[2024-05-29 12:43:32,311]   DEBUG [count_nac] bustools inspect -o kbVelo_debug_EC_S15/inspect.json -w kbVelo_debug_EC_S15/10x_version3_whitelist.txt kbVelo_debug_EC_S15/tmp/output.s.bus
[2024-05-29 12:44:38,387]    INFO [count_nac] Correcting BUS records in kbVelo_debug_EC_S15/tmp/output.s.bus to kbVelo_debug_EC_S15/tmp/output.s.c.bus with on-list kbVelo_debug_EC_S15/10x_version3_whitelist.txt
[2024-05-29 12:44:38,388]   DEBUG [count_nac] bustools correct -o kbVelo_debug_EC_S15/tmp/output.s.c.bus -w kbVelo_debug_EC_S15/10x_version3_whitelist.txt kbVelo_debug_EC_S15/tmp/output.s.bus
[2024-05-29 12:44:43,801]   DEBUG [count_nac] Found 6794880 barcodes in the on-list
[2024-05-29 12:45:06,925]   DEBUG [count_nac] Processed 397568322 BUS records
[2024-05-29 12:45:06,925]   DEBUG [count_nac] In on-list = 381611698
[2024-05-29 12:45:06,925]   DEBUG [count_nac] Corrected    = 1897623
[2024-05-29 12:45:06,925]   DEBUG [count_nac] Uncorrected  = 14059001
[2024-05-29 12:45:16,934]    INFO [count_nac] Sorting BUS file kbVelo_debug_EC_S15/tmp/output.s.c.bus to kbVelo_debug_EC_S15/output.unfiltered.bus
[2024-05-29 12:45:16,934]   DEBUG [count_nac] bustools sort -o kbVelo_debug_EC_S15/output.unfiltered.bus -T kbVelo_debug_EC_S15/tmp -t 20 -m 10G kbVelo_debug_EC_S15/tmp/output.s.c.bus
[2024-05-29 12:45:58,683]   DEBUG [count_nac] partition time: 3.07438s
[2024-05-29 12:46:28,513]   DEBUG [count_nac] partition time: 0.430518s
[2024-05-29 12:46:32,117]   DEBUG [count_nac] Read in 383509321 BUS records
[2024-05-29 12:46:32,117]   DEBUG [count_nac] reading time 13.6631s
[2024-05-29 12:46:32,117]   DEBUG [count_nac] sorting time 57.1149s
[2024-05-29 12:46:32,117]   DEBUG [count_nac] writing time 0s
[2024-05-29 12:47:37,893]    INFO [count_nac] Generating count matrix kbVelo_debug_EC_S15/counts_unfiltered/cells_x_genes from BUS file kbVelo_debug_EC_S15/output.unfiltered.bus
[2024-05-29 12:47:37,895]   DEBUG [count_nac] bustools count -o kbVelo_debug_EC_S15/counts_unfiltered/cells_x_genes -g /genomes/GRCm38.100/kb/GRCm38.100.t2g -e kbVelo_debug_EC_S15/matrix.ec -t kbVelo_debug_EC_S15/transcripts.txt -s /genomes/GRCm38.100/kb/intron_t2c.txt --genecounts --umi-gene kbVelo_debug_EC_S15/output.unfiltered.bus
Yenaled commented 1 month ago

Exact same issue as here: https://github.com/pachterlab/kb_python/issues/251

If you create your reference with kb ref and use those files directly, it should work. Your t2g file (and -c1 and -c2 files) needs to exactly match what's in your index. If it hangs, that means it doesn't match.

In general, if bustools count takes over 20 minutes, you should probably just kill the process because it's hanging.

dsb66 commented 1 month ago

Thanks, I missed that issue. I had both the standard and nac indexes in the same directory and that must have messed up one of the nac files. I rebuilt the nac index and know everything is fine.