pachterlab / kb_python

A wrapper for the kallisto | bustools workflow for single-cell RNA-seq pre-processing
https://www.kallistobus.tools/
BSD 2-Clause "Simplified" License
146 stars 23 forks source link

Kb count command stuck on bustools count step #251

Closed gauravgadhvi closed 4 months ago

gauravgadhvi commented 4 months ago

I am trying to quantify RNA reads from SlideSeqV2 data using a custom technology string (-x ) and a kallisto index (version 0.50.1) with help of the kb count (_kbpython 0.28.2) command. The total size of raw FASTQ files for each of my samples are ranging from 70GB to 150GB.

For a few samples, I was able to get results from kb count. Whereas for the majority of samples, it runs out of memory or just gets into a phase at the bustools count command where it never finishes running this step. For these samples, kb ran for 3-4 days and still didn't move ahead of the bustools count step.

Initially, I reckoned it is a memory issue and reduced the -m flag requested memory to half of the requested cluster memory, so as to allow kallisto and bustools binaries to have enough memory available, but no luck there either. Reducing the memory limit for kb allowed to remove any _badalloc errors, although it still won't move ahead of the bustools count step.

Here is an example of the command run and the verbose output log is what follows the command :

kb count --h5ad --verbose --strand unstranded -m 180G -i ./MUS_TEFamily_kb28_index.idx -g ./MUS_TEfam_custom_t2g.txt -x 0,0,8,0,26,32:0,32,41:1,0,60 -t 10 -w ./cleanBarcodes_whitelist.txt -o ./ ./210817_Puck_210817_11.L001/210817_Puck_210817_11.L001.R1.fastq.gz ./210817_Puck_210817_11.L001/210817_Puck_210817_11.L001.R2.fastq.gz ./210817_Puck_210817_11.L002/210817_Puck_210817_11.L002.R1.fastq.gz ./210817_Puck_210817_11.L002/210817_Puck_210817_11.L002.R2.fastq.gz ./210817_Puck_210817_11.L003/210817_Puck_210817_11.L003.R1.fastq.gz ./210817_Puck_210817_11.L003/210817_Puck_210817_11.L003.R2.fastq.gz ./210817_Puck_210817_11.L004/210817_Puck_210817_11.L004.R1.fastq.gz ./210817_Puck_210817_11.L004/210817_Puck_210817_11.L004.R2.fastq.gz

Output Log :


[2024-04-29 16:48:04,822]   DEBUG [main] Printing verbose output
[2024-04-29 16:48:07,288]   DEBUG [main] kallisto binary located at /home/ggadhvi/mambaforge/envs/kb28_env/lib/python3.11/site-packages/kb_python/bins/linux/kallisto/kallisto
[2024-04-29 16:48:07,288]   DEBUG [main] bustools binary located at /home/ggadhvi/mambaforge/envs/kb28_env/lib/python3.11/site-packages/kb_python/bins/linux/bustools/bustools
[2024-04-29 16:48:07,289]   DEBUG [main] Creating `./tmp` directory
[2024-04-29 16:48:07,290]   DEBUG [main] Namespace(list=False, command='count', tmp=None, keep_tmp=False, verbose=True, i='/nfs/turbo/umms-welchjd/ggadhvi/RefAnno/KBindex_MUS_TEFamily__kb28Latest/MUS_TEFamily_kb28_index.idx', g='/nfs/turbo/umms-welchjd/ggadhvi/RefAnno/KBindex_MUS_TEFamily__kb28Latest/MUS_TEfam_custom_t2g.txt', x='0,0,8,0,26,32:0,32,41:1,0,60', o='./', num=False, w='./cleanBarcodes_whitelist.txt', r=None, t=8, m='105G', strand='unstranded', inleaved=False, genomebam=False, aa=False, gtf=None, chromosomes=None, workflow='standard', em=False, mm=False, tcc=False, filter=None, filter_threshold=None, c1=None, c2=None, overwrite=False, dry_run=False, batch_barcodes=False, loom=False, h5ad=True, loom_names='barcode,target_name', sum='none', cellranger=False, gene_names=False, N=None, report=False, no_inspect=False, kallisto='/home/ggadhvi/mambaforge/envs/kb28_env/lib/python3.11/site-packages/kb_python/bins/linux/kallisto/kallisto', bustools='/home/ggadhvi/mambaforge/envs/kb28_env/lib/python3.11/site-packages/kb_python/bins/linux/bustools/bustools', no_validate=False, no_fragment=False, parity=None, fragment_l=None, fragment_s=None, bootstraps=None, matrix_to_files=False, matrix_to_directories=False, fastqs=['/gpfs/accounts/welchjd_root/welchjd0/ggadhvi/SlideSeq_TE_Analysis/SlideSeqv2_Raw_FASTQ/Puck_210817_11/210817_Puck_210817_11.L001/210817_Puck_210817_11.L001.R1.fastq.gz', '/gpfs/accounts/welchjd_root/welchjd0/ggadhvi/SlideSeq_TE_Analysis/SlideSeqv2_Raw_FASTQ/Puck_210817_11/210817_Puck_210817_11.L001/210817_Puck_210817_11.L001.R2.fastq.gz', '/gpfs/accounts/welchjd_root/welchjd0/ggadhvi/SlideSeq_TE_Analysis/SlideSeqv2_Raw_FASTQ/Puck_210817_11/210817_Puck_210817_11.L002/210817_Puck_210817_11.L002.R1.fastq.gz', '/gpfs/accounts/welchjd_root/welchjd0/ggadhvi/SlideSeq_TE_Analysis/SlideSeqv2_Raw_FASTQ/Puck_210817_11/210817_Puck_210817_11.L002/210817_Puck_210817_11.L002.R2.fastq.gz', '/gpfs/accounts/welchjd_root/welchjd0/ggadhvi/SlideSeq_TE_Analysis/SlideSeqv2_Raw_FASTQ/Puck_210817_11/210817_Puck_210817_11.L003/210817_Puck_210817_11.L003.R1.fastq.gz', '/gpfs/accounts/welchjd_root/welchjd0/ggadhvi/SlideSeq_TE_Analysis/SlideSeqv2_Raw_FASTQ/Puck_210817_11/210817_Puck_210817_11.L003/210817_Puck_210817_11.L003.R2.fastq.gz', '/gpfs/accounts/welchjd_root/welchjd0/ggadhvi/SlideSeq_TE_Analysis/SlideSeqv2_Raw_FASTQ/Puck_210817_11/210817_Puck_210817_11.L004/210817_Puck_210817_11.L004.R1.fastq.gz', '/gpfs/accounts/welchjd_root/welchjd0/ggadhvi/SlideSeq_TE_Analysis/SlideSeqv2_Raw_FASTQ/Puck_210817_11/210817_Puck_210817_11.L004/210817_Puck_210817_11.L004.R2.fastq.gz'])
[2024-04-29 16:48:18,438]    INFO [count] Skipping kallisto bus because output files already exist. Use the --overwrite flag to overwrite.
[2024-04-29 16:48:18,439]    INFO [count] Sorting BUS file ./output.bus to ./tmp/output.s.bus
[2024-04-29 16:48:18,439]   DEBUG [count] bustools sort -o ./tmp/output.s.bus -T ./tmp -t 8 -m 105G ./output.bus
[2024-04-29 16:50:19,451]   DEBUG [count] partition time: 5.10493s
[2024-04-29 16:50:24,957]   DEBUG [count] all fits in buffer
[2024-04-29 16:50:37,769]   DEBUG [count] Read in 218203111 BUS records
[2024-04-29 16:50:37,769]   DEBUG [count] reading time 2.12751s
[2024-04-29 16:50:37,769]   DEBUG [count] sorting time 42.6594s
[2024-04-29 16:50:37,770]   DEBUG [count] writing time 1.39058s
[2024-04-29 16:50:37,770]    INFO [count] Inspecting BUS file ./tmp/output.s.bus
[2024-04-29 16:50:37,770]   DEBUG [count] bustools inspect -o ./inspect.json -w ./cleanBarcodes_whitelist.txt ./tmp/output.s.bus
[2024-04-29 16:50:46,190]    INFO [count] Correcting BUS records in ./tmp/output.s.bus to ./tmp/output.s.c.bus with on-list ./cleanBarcodes_whitelist.txt
[2024-04-29 16:50:46,190]   DEBUG [count] bustools correct -o ./tmp/output.s.c.bus -w ./cleanBarcodes_whitelist.txt ./tmp/output.s.bus
[2024-04-29 16:50:48,094]   DEBUG [count] Found 161600 barcodes in the on-list
[2024-04-29 16:51:01,807]   DEBUG [count] Processed 13010737 BUS records
[2024-04-29 16:51:01,807]   DEBUG [count] In on-list = 1857751
[2024-04-29 16:51:01,807]   DEBUG [count] Corrected    = 1542115
[2024-04-29 16:51:01,807]   DEBUG [count] Uncorrected  = 9610871
[2024-04-29 16:51:01,808]    INFO [count] Sorting BUS file ./tmp/output.s.c.bus to ./output.unfiltered.bus
[2024-04-29 16:51:01,808]   DEBUG [count] bustools sort -o ./output.unfiltered.bus -T ./tmp -t 8 -m 105G ./tmp/output.s.c.bus
[2024-04-29 16:51:58,159]   DEBUG [count] partition time: 0.047635s
[2024-04-29 16:51:58,259]   DEBUG [count] all fits in buffer
[2024-04-29 16:52:09,569]   DEBUG [count] Read in 3399866 BUS records
[2024-04-29 16:52:09,570]   DEBUG [count] reading time 0.033655s
[2024-04-29 16:52:09,570]   DEBUG [count] sorting time 0.451159s
[2024-04-29 16:52:09,570]   DEBUG [count] writing time 0.083456s
[2024-04-29 16:52:09,572]    INFO [count] Generating count matrix ./counts_unfiltered/cells_x_genes from BUS file ./output.unfiltered.bus
[2024-04-29 16:52:09,573]   DEBUG [count] bustools count -o ./counts_unfiltered/cells_x_genes -g /nfs/turbo/umms-welchjd/ggadhvi/RefAnno/KBindex_MUS_TEFamily__kb28Latest/MUS_TEfam_custom_t2g.txt -e ./matrix.ec -t ./transcripts.txt --genecounts --umi-gene ./output.unfiltered.bus
slurmstepd: error: *** JOB 7618192 ON gl0005 CANCELLED AT 2024-05-03T12:32:30 DUE TO TIME LIMIT ***

Here is an output log for one of the samples for which kb count successfully ran without being stuck so we know the command works for this dataset :


[2024-04-22 17:22:03,100]   DEBUG [main] Printing verbose output
[2024-04-22 17:22:05,304]   DEBUG [main] kallisto binary located at /home/ggadhvi/mambaforge/envs/kb28_env/lib/python3.11/site-packages/kb_python/bins/linux/kallisto/kallisto
[2024-04-22 17:22:05,305]   DEBUG [main] bustools binary located at /home/ggadhvi/mambaforge/envs/kb28_env/lib/python3.11/site-packages/kb_python/bins/linux/bustools/bustools
[2024-04-22 17:22:05,305]   DEBUG [main] Creating `./tmp` directory
[2024-04-22 17:22:05,305]   DEBUG [main] Namespace(list=False, command='count', tmp=None, keep_tmp=False, verbose=True, i='/nfs/turbo/umms-welchjd/ggadhvi/RefAnno/KBindex_MUS_TEFamily__kb28Latest/MUS_TEFamily_kb28_index.idx', g='/nfs/turbo/umms-welchjd/ggadhvi/RefAnno/KBindex_MUS_TEFamily__kb28Latest/MUS_TEfam_custom_t2g.txt', x='0,0,8,0,26,32:0,32,41:1,0,60', o='./', num=False, w='./cleanBarcodes_whitelist.txt', r=None, t=8, m='155G', strand='unstranded', inleaved=False, genomebam=False, aa=False, gtf=None, chromosomes=None, workflow='standard', em=False, mm=False, tcc=False, filter=None, filter_threshold=None, c1=None, c2=None, overwrite=False, dry_run=False, batch_barcodes=False, loom=False, h5ad=True, loom_names='barcode,target_name', sum='none', cellranger=False, gene_names=False, N=None, report=False, no_inspect=False, kallisto='/home/ggadhvi/mambaforge/envs/kb28_env/lib/python3.11/site-packages/kb_python/bins/linux/kallisto/kallisto', bustools='/home/ggadhvi/mambaforge/envs/kb28_env/lib/python3.11/site-packages/kb_python/bins/linux/bustools/bustools', no_validate=False, no_fragment=False, parity=None, fragment_l=None, fragment_s=None, bootstraps=None, matrix_to_files=False, matrix_to_directories=False, fastqs=['/gpfs/accounts/welchjd_root/welchjd0/ggadhvi/SlideSeq_TE_Analysis/SlideSeqv2_Raw_FASTQ/Puck_210224_05/210224_Puck_210224_05.L001/210224_Puck_210224_05.L001.R1.fastq.gz', '/gpfs/accounts/welchjd_root/welchjd0/ggadhvi/SlideSeq_TE_Analysis/SlideSeqv2_Raw_FASTQ/Puck_210224_05/210224_Puck_210224_05.L001/210224_Puck_210224_05.L001.R2.fastq.gz', '/gpfs/accounts/welchjd_root/welchjd0/ggadhvi/SlideSeq_TE_Analysis/SlideSeqv2_Raw_FASTQ/Puck_210224_05/210224_Puck_210224_05.L002/210224_Puck_210224_05.L002.R1.fastq.gz', '/gpfs/accounts/welchjd_root/welchjd0/ggadhvi/SlideSeq_TE_Analysis/SlideSeqv2_Raw_FASTQ/Puck_210224_05/210224_Puck_210224_05.L002/210224_Puck_210224_05.L002.R2.fastq.gz', '/gpfs/accounts/welchjd_root/welchjd0/ggadhvi/SlideSeq_TE_Analysis/SlideSeqv2_Raw_FASTQ/Puck_210224_05/210224_Puck_210224_05.L003/210224_Puck_210224_05.L003.R1.fastq.gz', '/gpfs/accounts/welchjd_root/welchjd0/ggadhvi/SlideSeq_TE_Analysis/SlideSeqv2_Raw_FASTQ/Puck_210224_05/210224_Puck_210224_05.L003/210224_Puck_210224_05.L003.R2.fastq.gz', '/gpfs/accounts/welchjd_root/welchjd0/ggadhvi/SlideSeq_TE_Analysis/SlideSeqv2_Raw_FASTQ/Puck_210224_05/210224_Puck_210224_05.L004/210224_Puck_210224_05.L004.R1.fastq.gz', '/gpfs/accounts/welchjd_root/welchjd0/ggadhvi/SlideSeq_TE_Analysis/SlideSeqv2_Raw_FASTQ/Puck_210224_05/210224_Puck_210224_05.L004/210224_Puck_210224_05.L004.R2.fastq.gz'])
[2024-04-22 17:22:08,629]    INFO [count] Using index /nfs/turbo/umms-welchjd/ggadhvi/RefAnno/KBindex_MUS_TEFamily__kb28Latest/MUS_TEFamily_kb28_index.idx to generate BUS file to ./ from
[2024-04-22 17:22:08,629]    INFO [count]         /gpfs/accounts/welchjd_root/welchjd0/ggadhvi/SlideSeq_TE_Analysis/SlideSeqv2_Raw_FASTQ/Puck_210224_05/210224_Puck_210224_05.L001/210224_Puck_210224_05.L001.R1.fastq.gz
[2024-04-22 17:22:08,629]    INFO [count]         /gpfs/accounts/welchjd_root/welchjd0/ggadhvi/SlideSeq_TE_Analysis/SlideSeqv2_Raw_FASTQ/Puck_210224_05/210224_Puck_210224_05.L001/210224_Puck_210224_05.L001.R2.fastq.gz
[2024-04-22 17:22:08,629]    INFO [count]         /gpfs/accounts/welchjd_root/welchjd0/ggadhvi/SlideSeq_TE_Analysis/SlideSeqv2_Raw_FASTQ/Puck_210224_05/210224_Puck_210224_05.L002/210224_Puck_210224_05.L002.R1.fastq.gz
[2024-04-22 17:22:08,629]    INFO [count]         /gpfs/accounts/welchjd_root/welchjd0/ggadhvi/SlideSeq_TE_Analysis/SlideSeqv2_Raw_FASTQ/Puck_210224_05/210224_Puck_210224_05.L002/210224_Puck_210224_05.L002.R2.fastq.gz
[2024-04-22 17:22:08,979]    INFO [count]         /gpfs/accounts/welchjd_root/welchjd0/ggadhvi/SlideSeq_TE_Analysis/SlideSeqv2_Raw_FASTQ/Puck_210224_05/210224_Puck_210224_05.L003/210224_Puck_210224_05.L003.R1.fastq.gz
[2024-04-22 17:22:08,979]    INFO [count]         /gpfs/accounts/welchjd_root/welchjd0/ggadhvi/SlideSeq_TE_Analysis/SlideSeqv2_Raw_FASTQ/Puck_210224_05/210224_Puck_210224_05.L003/210224_Puck_210224_05.L003.R2.fastq.gz
[2024-04-22 17:22:08,979]    INFO [count]         /gpfs/accounts/welchjd_root/welchjd0/ggadhvi/SlideSeq_TE_Analysis/SlideSeqv2_Raw_FASTQ/Puck_210224_05/210224_Puck_210224_05.L004/210224_Puck_210224_05.L004.R1.fastq.gz
[2024-04-22 17:22:08,979]    INFO [count]         /gpfs/accounts/welchjd_root/welchjd0/ggadhvi/SlideSeq_TE_Analysis/SlideSeqv2_Raw_FASTQ/Puck_210224_05/210224_Puck_210224_05.L004/210224_Puck_210224_05.L004.R2.fastq.gz
[2024-04-22 17:22:08,979]   DEBUG [count] kallisto bus -i /nfs/turbo/umms-welchjd/ggadhvi/RefAnno/KBindex_MUS_TEFamily__kb28Latest/MUS_TEFamily_kb28_index.idx -o ./ -x 0,0,8,0,26,32:0,32,41:1,0,60 -t 8 --unstranded /gpfs/accounts/welchjd_root/welchjd0/ggadhvi/SlideSeq_TE_Analysis/SlideSeqv2_Raw_FASTQ/Puck_210224_05/210224_Puck_210224_05.L001/210224_Puck_210224_05.L001.R1.fastq.gz /gpfs/accounts/welchjd_root/welchjd0/ggadhvi/SlideSeq_TE_Analysis/SlideSeqv2_Raw_FASTQ/Puck_210224_05/210224_Puck_210224_05.L001/210224_Puck_210224_05.L001.R2.fastq.gz /gpfs/accounts/welchjd_root/welchjd0/ggadhvi/SlideSeq_TE_Analysis/SlideSeqv2_Raw_FASTQ/Puck_210224_05/210224_Puck_210224_05.L002/210224_Puck_210224_05.L002.R1.fastq.gz /gpfs/accounts/welchjd_root/welchjd0/ggadhvi/SlideSeq_TE_Analysis/SlideSeqv2_Raw_FASTQ/Puck_210224_05/210224_Puck_210224_05.L002/210224_Puck_210224_05.L002.R2.fastq.gz /gpfs/accounts/welchjd_root/welchjd0/ggadhvi/SlideSeq_TE_Analysis/SlideSeqv2_Raw_FASTQ/Puck_210224_05/210224_Puck_210224_05.L003/210224_Puck_210224_05.L003.R1.fastq.gz /gpfs/accounts/welchjd_root/welchjd0/ggadhvi/SlideSeq_TE_Analysis/SlideSeqv2_Raw_FASTQ/Puck_210224_05/210224_Puck_210224_05.L003/210224_Puck_210224_05.L003.R2.fastq.gz /gpfs/accounts/welchjd_root/welchjd0/ggadhvi/SlideSeq_TE_Analysis/SlideSeqv2_Raw_FASTQ/Puck_210224_05/210224_Puck_210224_05.L004/210224_Puck_210224_05.L004.R1.fastq.gz /gpfs/accounts/welchjd_root/welchjd0/ggadhvi/SlideSeq_TE_Analysis/SlideSeqv2_Raw_FASTQ/Puck_210224_05/210224_Puck_210224_05.L004/210224_Puck_210224_05.L004.R2.fastq.gz
[2024-04-22 17:22:09,080]   DEBUG [count] 
[2024-04-22 17:22:09,180]   DEBUG [count] [index] k-mer length: 31
[2024-04-22 17:22:09,180]   DEBUG [count] [index] number of targets: 1,374
[2024-04-22 17:22:09,180]   DEBUG [count] [index] number of k-mers: 1,366,866
[2024-04-22 17:22:09,180]   DEBUG [count] [quant] will process sample 1: /gpfs/accounts/welchjd_root/welchjd0/ggadhvi/SlideSeq_TE_Analysis/SlideSeqv2_Raw_FASTQ/Puck_210224_05/210224_Puck_210224_05.L001/210224_Puck_210224_05.L001.R1.fastq.gz
[2024-04-22 17:22:09,180]   DEBUG [count] /gpfs/accounts/welchjd_root/welchjd0/ggadhvi/SlideSeq_TE_Analysis/SlideSeqv2_Raw_FASTQ/Puck_210224_05/210224_Puck_210224_05.L001/210224_Puck_210224_05.L001.R2.fastq.gz
[2024-04-22 17:22:09,180]   DEBUG [count] [quant] will process sample 2: /gpfs/accounts/welchjd_root/welchjd0/ggadhvi/SlideSeq_TE_Analysis/SlideSeqv2_Raw_FASTQ/Puck_210224_05/210224_Puck_210224_05.L002/210224_Puck_210224_05.L002.R1.fastq.gz
[2024-04-22 17:22:09,180]   DEBUG [count] /gpfs/accounts/welchjd_root/welchjd0/ggadhvi/SlideSeq_TE_Analysis/SlideSeqv2_Raw_FASTQ/Puck_210224_05/210224_Puck_210224_05.L002/210224_Puck_210224_05.L002.R2.fastq.gz
[2024-04-22 17:22:09,180]   DEBUG [count] [quant] will process sample 3: /gpfs/accounts/welchjd_root/welchjd0/ggadhvi/SlideSeq_TE_Analysis/SlideSeqv2_Raw_FASTQ/Puck_210224_05/210224_Puck_210224_05.L003/210224_Puck_210224_05.L003.R1.fastq.gz
[2024-04-22 17:22:09,180]   DEBUG [count] /gpfs/accounts/welchjd_root/welchjd0/ggadhvi/SlideSeq_TE_Analysis/SlideSeqv2_Raw_FASTQ/Puck_210224_05/210224_Puck_210224_05.L003/210224_Puck_210224_05.L003.R2.fastq.gz
[2024-04-22 17:22:09,181]   DEBUG [count] [quant] will process sample 4: /gpfs/accounts/welchjd_root/welchjd0/ggadhvi/SlideSeq_TE_Analysis/SlideSeqv2_Raw_FASTQ/Puck_210224_05/210224_Puck_210224_05.L004/210224_Puck_210224_05.L004.R1.fastq.gz
[2024-04-22 17:22:09,181]   DEBUG [count] /gpfs/accounts/welchjd_root/welchjd0/ggadhvi/SlideSeq_TE_Analysis/SlideSeqv2_Raw_FASTQ/Puck_210224_05/210224_Puck_210224_05.L004/210224_Puck_210224_05.L004.R2.fastq.gz
[2024-04-22 17:22:19,387]   DEBUG [count] [quant] finding pseudoalignments for the reads ...
[2024-04-22 17:22:22,389]   DEBUG [count] [progress] 1M reads processed (12.4% mapped)
[2024-04-22 17:22:26,091]   DEBUG [count] [progress] 2M reads processed (12.5% mapped)
..
..
..
[2024-04-22 17:59:37,745]   DEBUG [count] [progress] 1309M reads processed (12.3% mapped)
[2024-04-22 17:59:39,346]   DEBUG [count] [progress] 1310M reads processed (12.3% mapped)
[2024-04-22 17:59:41,147]   DEBUG [count] [progress] 1311M reads processed (12.3% mapped)              done
[2024-04-22 17:59:41,147]   DEBUG [count] [quant] processed 1,312,229,814 reads, 161,725,045 reads pseudoaligned
[2024-04-22 17:59:41,147]   DEBUG [count] 
[2024-04-22 17:59:41,147]    INFO [count] Sorting BUS file ./output.bus to ./tmp/output.s.bus
[2024-04-22 17:59:41,148]   DEBUG [count] bustools sort -o ./tmp/output.s.bus -T ./tmp -t 8 -m 155G ./output.bus
[2024-04-22 18:06:05,573]   DEBUG [count] partition time: 2.843s
[2024-04-22 18:06:08,608]   DEBUG [count] all fits in buffer
[2024-04-22 18:06:17,113]   DEBUG [count] Read in 161725045 BUS records
[2024-04-22 18:06:17,113]   DEBUG [count] reading time 1.2075s
[2024-04-22 18:06:17,113]   DEBUG [count] sorting time 24.769s
[2024-04-22 18:06:17,113]   DEBUG [count] writing time 0.625065s
[2024-04-22 18:06:17,114]    INFO [count] Inspecting BUS file ./tmp/output.s.bus
[2024-04-22 18:06:17,114]   DEBUG [count] bustools inspect -o ./inspect.json -w ./cleanBarcodes_whitelist.txt ./tmp/output.s.bus
[2024-04-22 18:06:20,922]    INFO [count] Correcting BUS records in ./tmp/output.s.bus to ./tmp/output.s.c.bus with on-list ./cleanBarcodes_whitelist.txt
[2024-04-22 18:06:20,922]   DEBUG [count] bustools correct -o ./tmp/output.s.c.bus -w ./cleanBarcodes_whitelist.txt ./tmp/output.s.bus
[2024-04-22 18:06:21,123]   DEBUG [count] Found 83063 barcodes in the on-list
[2024-04-22 18:06:25,325]   DEBUG [count] Processed 9959532 BUS records
[2024-04-22 18:06:25,325]   DEBUG [count] In on-list = 1200909
[2024-04-22 18:06:25,325]   DEBUG [count] Corrected    = 1327153
[2024-04-22 18:06:25,325]   DEBUG [count] Uncorrected  = 7431470
[2024-04-22 18:06:26,426]    INFO [count] Sorting BUS file ./tmp/output.s.c.bus to ./output.unfiltered.bus
[2024-04-22 18:06:26,426]   DEBUG [count] bustools sort -o ./output.unfiltered.bus -T ./tmp -t 8 -m 155G ./tmp/output.s.c.bus
[2024-04-22 18:07:50,577]   DEBUG [count] partition time: 0.022433s
[2024-04-22 18:07:50,677]   DEBUG [count] all fits in buffer
[2024-04-22 18:07:58,635]   DEBUG [count] Read in 2528062 BUS records
[2024-04-22 18:07:58,635]   DEBUG [count] reading time 0.014686s
[2024-04-22 18:07:58,635]   DEBUG [count] sorting time 0.220296s
[2024-04-22 18:07:58,635]   DEBUG [count] writing time 0.046158s
[2024-04-22 18:07:58,677]    INFO [count] Generating count matrix ./counts_unfiltered/cells_x_genes from BUS file ./output.unfiltered.bus
[2024-04-22 18:07:58,683]   DEBUG [count] bustools count -o ./counts_unfiltered/cells_x_genes -g /nfs/turbo/umms-welchjd/ggadhvi/RefAnno/KBindex_MUS_TEFamily__kb28Latest/MUS_TEfam_custom_t2g.txt -e ./matrix.ec -t ./transcripts.txt --genecounts --umi-gene ./output.unfiltered.bus
[2024-04-22 18:08:02,573]   DEBUG [count] ./counts_unfiltered/cells_x_genes.mtx passed validation
[2024-04-22 18:08:02,573]    INFO [count] Writing gene names to file ./counts_unfiltered/cells_x_genes.genes.names.txt
[2024-04-22 18:08:02,650] WARNING [count] Found duplicate entries for tRNA-Thr-ACY in /nfs/turbo/umms-welchjd/ggadhvi/RefAnno/KBindex_MUS_TEFamily__kb28Latest/MUS_TEfam_custom_t2g.txt. Earlier entries will be ignored.
[2024-04-22 18:08:02,650] WARNING [count] Found duplicate entries for U13 in /nfs/turbo/umms-welchjd/ggadhvi/RefAnno/KBindex_MUS_TEFamily__kb28Latest/MUS_TEfam_custom_t2g.txt. Earlier entries will be ignored.
[2024-04-22 18:08:02,650] WARNING [count] Found duplicate entries for tRNA-His-CAY in /nfs/turbo/umms-welchjd/ggadhvi/RefAnno/KBindex_MUS_TEFamily__kb28Latest/MUS_TEfam_custom_t2g.txt. Earlier entries will be ignored.
[2024-04-22 18:08:02,650] WARNING [count] Found duplicate entries for tRNA-Arg-CGA in /nfs/turbo/umms-welchjd/ggadhvi/RefAnno/KBindex_MUS_TEFamily__kb28Latest/MUS_TEfam_custom_t2g.txt. Earlier entries will be ignored.
[2024-04-22 18:08:02,650] WARNING [count] Found duplicate entries for tRNA-Thr-ACG in /nfs/turbo/umms-welchjd/ggadhvi/RefAnno/KBindex_MUS_TEFamily__kb28Latest/MUS_TEfam_custom_t2g.txt. Earlier entries will be ignored.
[2024-04-22 18:08:02,653] WARNING [count] 1369 gene IDs do not have corresponding valid gene names. These genes will use their gene IDs instead.
[2024-04-22 18:08:02,658]    INFO [count] Reading matrix ./counts_unfiltered/cells_x_genes.mtx
[2024-04-22 18:08:04,402]    INFO [count] Writing matrix to h5ad ./counts_unfiltered/adata.h5ad
[2024-04-22 18:08:04,768]   DEBUG [main] Removing `./tmp` directory
Mouse TE-family Alignment complete for puck :Puck_210224_05

Do you have any suggestions for why bustools count takes so long or if it might've entered a bad state?

I am not sure which part of the kb count pipeline demands the most amount of memory and what is the suggested memory requirement for large FASTQ samples (such as 150G)? Is there any way to speed up the processing by using more threads or processors? Or is it dependent on the total RAM allocated to the job?

Thank you in advance for helping me figure this out so we can optimize our kb-python jobs and runtime!

-Gaurav

Yenaled commented 4 months ago

Memory is not an issue — there’s no need to specify anything for -m (in fact, I recommend that you don’t; kb count usually takes in the single digits worth of memory).

The issue is bustools count hanging (it has indeed entered a bad state). This can sometimes happen if your t2g.txt file is not formatted correctly (how did you create it / how was kb ref run?).

If you email me the output.unfiltered.bus from one the failed runs along with the t2g.txt, matrix.ec, and transcripts.txt file, I can diagnose the issue and tell you exactly what’s wrong (maybe the transcripts.txt is in a different order than the transcripts in the t2g.txt file? The fact that even the successful run says there are “duplicate entries” is cause for concern). Otherwise, re-examine how kb ref was run / how the index+t2g files were created.

gauravgadhvi commented 4 months ago

I created an index using custom sequences in a fasta file and running kallisto index to get the index.

Since I didn't have any GTF file or coordinates for these sequences, I couldn't run kb ref. Upon further discussion Sina had suggested to create a dummy t2g.txt file with names of each of the contigs from the fasta file in two columns. So I have been using the t2g.txt file along with the kallisto created index to run kb count.

I have emailed you the files you asked for along with the t2g.txt and the fasta file for reference. Thank you for helping with the diagnosis of the issue.

Yenaled commented 4 months ago

OK, your problem is indeed with the t2g.txt file (problems with this file are the only reason I have ever seen bustools count hang).

The following transcripts are duplicated in the first line of the t2g.txt file: U13, tRNA-Arg-CGA, tRNA-His-CAY, tRNA-Thr-ACG, tRNA-Thr-ACY

If you look in your transcripts.txt file, they actually have distinct names (i.e. the first U13 is named "U13", the second "U13" is named "U13_1", etc.).

You must be super careful and make absolutely sure that your t2g.txt and the generated transcripts.txt file are perfectly concordant. This means that your transcripts.txt file must be 100% identical to the t2g.txt file (exact same order, exact same names).

Ideally, I'd have the program print out an error (or at least a warning message) when such a mismatch exists, but I never got around to doing so.

gauravgadhvi commented 4 months ago

That is really helpful to know about the exact matches in the t2g.txt and the transcripts.txt. I saw a warning message showing the de-duplication of conflicting transcript IDs and assumed it would be implicitly taken care of in the background. I will create a new index with the updated contig names and retry processing my data.

Thank you so much for your help and suggestion! :)

gauravgadhvi commented 4 months ago

That worked like magic, with memory less than 10GB required even for larger samples! Thank you again!