Effective length normalization for full-length UMI scRNA-seq data

biobenkj commented 12 months ago

Describe the issue Is it possible to perform fragment length normalization for full-length UMI scRNA-seq data? I ask because I usually generate the TCC matrix and follow up with kallisto quant-tcc and see it can take in a fragment length file. Is there an option I need to throw to generate that file? Thanks!

What is the exact command that was run?

/varidata/research/projects/shen/tools/benkj/mambaforge/envs/kb/bin/kb count --verbose -t 8 -i ../ref/ens101_ercc92.idx -g ../ref/ens101_ercc92_t2g.txt -x STORM-seq --tcc --umi-gene --kallisto /varidata/research/projects/shen/tools/benkj/kallisto/build/src/kallisto --bustools /varidata/research/projects/shen/tools/benkj/mambaforge/envs/kb/bin/bustools -o K24_kb_tcc K24_L000_R1_001.100k.fq.gz K24_L000_R2_001.100k.fq.gz

Command output (with --verbose flag)

[2023-10-06 12:41:26,259]   DEBUG [main] Printing verbose output
[2023-10-06 12:41:28,475]   DEBUG [main] kallisto binary located at /varidata/research/projects/shen/tools/benkj/kallisto/build/src/kallisto
[2023-10-06 12:41:28,476]   DEBUG [main] bustools binary located at /varidata/research/projects/shen/tools/benkj/mambaforge/envs/kb/bin/bustools
[2023-10-06 12:41:28,476]   DEBUG [main] Creating `K24_kb_tcc/tmp` directory
[2023-10-06 12:41:28,620]   DEBUG [main] Namespace(list=False, command='count', tmp=None, keep_tmp=False, verbose=True, i='../ref/ens101_ercc92.idx', g='../ref/ens101_ercc92_t2g.txt', x='STORM-seq', o='K24_kb_tcc', w=None, t=8, m='4G', strand=None, workflow='standard', em=False, umi_gene=True, mm=False, tcc=True, filter=None, filter_threshold=None, c1=None, c2=None, overwrite=False, dry_run=False, loom=False, h5ad=False, cellranger=False, gene_names=False, report=False, no_inspect=False, kallisto='/varidata/research/projects/shen/tools/benkj/kallisto/build/src/kallisto', bustools='/varidata/research/projects/shen/tools/benkj/mambaforge/envs/kb/bin/bustools', no_validate=False, parity=None, fragment_l=None, fragment_s=None, fastqs=['K24_L000_R1_001.100k.fq.gz', 'K24_L000_R2_001.100k.fq.gz'])
[2023-10-06 12:41:36,290]    INFO [count] Using index ../ref/ens101_ercc92.idx to generate BUS file to K24_kb_tcc from
[2023-10-06 12:41:36,290]    INFO [count]         K24_L000_R1_001.100k.fq.gz
[2023-10-06 12:41:36,290]    INFO [count]         K24_L000_R2_001.100k.fq.gz
[2023-10-06 12:41:36,290]   DEBUG [count] kallisto bus -i ../ref/ens101_ercc92.idx -o K24_kb_tcc -x STORM-seq -t 8 K24_L000_R1_001.100k.fq.gz K24_L000_R2_001.100k.fq.gz
[2023-10-06 12:41:36,391]   DEBUG [count]
[2023-10-06 12:41:36,391]   DEBUG [count] [bus] Note: Strand option was not specified; setting it to --rf-stranded for specified technology
[2023-10-06 12:41:39,996]   DEBUG [count] [index] k-mer length: 31
[2023-10-06 12:41:41,499]   DEBUG [count] [index] number of targets: 229,579
[2023-10-06 12:41:41,499]   DEBUG [count] [index] number of k-mers: 141,699,932
[2023-10-06 12:41:41,499]   DEBUG [count] [quant] will process sample 1: K24_L000_R1_001.100k.fq.gz
[2023-10-06 12:41:41,499]   DEBUG [count] K24_L000_R2_001.100k.fq.gz
[2023-10-06 12:41:43,602]   DEBUG [count] [quant] finding pseudoalignments for the reads ... done
[2023-10-06 12:41:43,602]   DEBUG [count] [quant] processed 100,000 reads, 65,976 reads pseudoaligned
[2023-10-06 12:41:43,602]   DEBUG [count] [quant] estimated average fragment length: 308.761
[2023-10-06 12:41:43,702]   DEBUG [count]
[2023-10-06 12:41:46,606]   DEBUG [count] K24_kb_tcc/output.bus passed validation
[2023-10-06 12:41:46,606]    INFO [count] Sorting BUS file K24_kb_tcc/output.bus to K24_kb_tcc/tmp/output.s.bus
[2023-10-06 12:41:46,606]   DEBUG [count] bustools sort -o K24_kb_tcc/tmp/output.s.bus -T K24_kb_tcc/tmp -t 8 -m 4G K24_kb_tcc/output.bus
[2023-10-06 12:41:48,909]   DEBUG [count] all fits in buffer
[2023-10-06 12:41:48,909]   DEBUG [count] Read in 65976 BUS records
[2023-10-06 12:41:48,909]   DEBUG [count] reading time 0.000524s
[2023-10-06 12:41:48,909]   DEBUG [count] sorting time 0.005998s
[2023-10-06 12:41:48,909]   DEBUG [count] writing time 0.001978s
[2023-10-06 12:41:50,011]   DEBUG [count] K24_kb_tcc/tmp/output.s.bus passed validation
[2023-10-06 12:41:50,012]    INFO [count] Whitelist not provided
[2023-10-06 12:41:50,012]    INFO [count] Generating whitelist K24_kb_tcc/whitelist.txt from BUS file K24_kb_tcc/tmp/output.s.bus
[2023-10-06 12:41:50,012]   DEBUG [count] bustools whitelist -o K24_kb_tcc/whitelist.txt K24_kb_tcc/tmp/output.s.bus
[2023-10-06 12:41:51,113]   DEBUG [count] Read in 49259 BUS records, wrote 2 barcodes to whitelist with threshold 979
[2023-10-06 12:41:51,113]    INFO [count] Inspecting BUS file K24_kb_tcc/tmp/output.s.bus
[2023-10-06 12:41:51,113]   DEBUG [count] bustools inspect -o K24_kb_tcc/inspect.json -w K24_kb_tcc/whitelist.txt K24_kb_tcc/tmp/output.s.bus
[2023-10-06 12:41:52,215]    INFO [count] Correcting BUS records in K24_kb_tcc/tmp/output.s.bus to K24_kb_tcc/tmp/output.s.c.bus with whitelist K24_kb_tcc/whitelist.txt
[2023-10-06 12:41:52,215]   DEBUG [count] bustools correct -o K24_kb_tcc/tmp/output.s.c.bus -w K24_kb_tcc/whitelist.txt K24_kb_tcc/tmp/output.s.bus
[2023-10-06 12:41:53,316]   DEBUG [count] Found 1 barcodes in the on-list
[2023-10-06 12:41:53,316]   DEBUG [count] Processed 49259 BUS records
[2023-10-06 12:41:53,316]   DEBUG [count] In on-list = 49259
[2023-10-06 12:41:53,316]   DEBUG [count] Corrected    = 0
[2023-10-06 12:41:53,316]   DEBUG [count] Uncorrected  = 0
[2023-10-06 12:41:54,418]   DEBUG [count] K24_kb_tcc/tmp/output.s.c.bus passed validation
[2023-10-06 12:41:54,418]    INFO [count] Sorting BUS file K24_kb_tcc/tmp/output.s.c.bus to K24_kb_tcc/output.unfiltered.bus
[2023-10-06 12:41:54,418]   DEBUG [count] bustools sort -o K24_kb_tcc/output.unfiltered.bus -T K24_kb_tcc/tmp -t 8 -m 4G K24_kb_tcc/tmp/output.s.c.bus
[2023-10-06 12:41:56,621]   DEBUG [count] all fits in buffer
[2023-10-06 12:41:56,621]   DEBUG [count] Read in 49259 BUS records
[2023-10-06 12:41:56,621]   DEBUG [count] reading time 0.000387s
[2023-10-06 12:41:56,622]   DEBUG [count] sorting time 0.002164s
[2023-10-06 12:41:56,622]   DEBUG [count] writing time 0.00185s
[2023-10-06 12:41:57,723]   DEBUG [count] K24_kb_tcc/output.unfiltered.bus passed validation
[2023-10-06 12:41:57,724]    INFO [count] Generating count matrix K24_kb_tcc/counts_unfiltered/cells_x_tcc from BUS file K24_kb_tcc/output.unfiltered.bus
[2023-10-06 12:41:57,725]   DEBUG [count] bustools count -o K24_kb_tcc/counts_unfiltered/cells_x_tcc -g ../ref/ens101_ercc92_t2g.txt -e K24_kb_tcc/matrix.ec -t K24_kb_tcc/transcripts.txt --multimapping --umi-gene K24_kb_tcc/output.unfiltered.bus
[2023-10-06 12:41:59,141]   DEBUG [count] K24_kb_tcc/counts_unfiltered/cells_x_tcc.mtx passed validation
[2023-10-06 12:41:59,142]   DEBUG [main] Removing `K24_kb_tcc/tmp` directory

Yenaled commented 12 months ago

You could just supply the mean and standard deviation of fragment length, e.g. -l 200 -s 20

biobenkj commented 12 months ago

Since this is paired end data, would I want kallisto to estimate these?

Yenaled commented 12 months ago

Oh yes, ideally yes. An flens.txt file should have been placed in the output directory from the kb count run. Are you able to find it?

biobenkj commented 12 months ago

Yep!

Example output:

0 4 1 0 1 1 1 2 1 3 4 3 4 0 2 2 3 4 1 2 13 3 0 0 2 1 0 2 0 1 2 5 2 1 8 0 2 1 3 3 2 5 3 3 4 2 1 0 0 0 1 2 2 3 1 0 0 1 1 3 2 1 0 1 1 1 1 3 7 1 1 1 1 0 0 1 2 1 2 0 3 2 1 0 0 2 1 1 3 2 0 3 4 2 0 4 1 2 3 0 2 0 3 4 0 4 2 1 5 4 3 6 2 1 8 7 3 4 3 5 5 2 6 6 7 6 11 6 8 7 7 18 4 9 6 10 12 28 10 14 9 8 13 9 12 16 17 9 13 15 10 15 8 12 19 19 22 12 24 19 21 19 28 20 16 23 16 10 26 24 22 23 21 32 23 11 28 25 14 35 20 15 37 27 25 29 27 20 20 18 35 24 31 24 34 17 18 37 36 17 30 32 30 24 30 32 33 45 34 20 38 26 35 45 32 21 24 30 18 18 25 29 33 24 31 39 46 35 30 34 29 35 23 19 28 18 23 29 47 19 16 27 30 17 13 35 17 19 29 45 42 15 17 15 30 29 17 17 16 24 36 52 20 25 26 23 23 23 14 13 21 24 31 18 10 11 31 22 10 12 9 17 13 14 15 14 15 18 23 23 29 30 14 24 18 16 25 60 21 26 40 16 24 34 39 29 23 17 35 22 11 22 9 14 29 12 20 30 8 26 14 18 25 12 27 8 18 9 17 12 8 21 8 17 11 11 9 11 15 12 13 15 4 10 11 13 7 10 14 12 16 21 7 23 21 8 31 2 15 13 13 11 15 18 10 4 13 15 6 6 15 32 22 14 14 8 10 5 9 12 19 9 22 14 6 11 7 10 5 12 13 15 8 8 4 10 14 9 12 6 9 9 12 4 5 15 14 13 2 10 7 8 10 18 5 6 26 12 5 11 10 11 17 6 10 5 2 8 3 6 11 20 12 2 8 6 6 12 14 9 3 6 15 7 3 6 4 10 1 13 13 0 7 8 10 2 8 17 7 6 4 5 11 4 15 5 8 3 6 9 15 0 4 4 3 2 2 3 0 6 29 15 35 16 11 8 8 12 8 11 4 3 8 7 5 17 9 17 10 8 9 1 5 8 10 0 2 4 3 1 9 1 4 1 3 1 22 9 3 2 5 8 6 6 5 1 1 4 3 4 7 0 3 3 2 6 5 11 2 3 2 4 2 0 11 3 2 0 1 4 4 1 3 0 6 0 2 0 2 4 3 5 6 6 4 4 3 0 2 2 1 7 3 0 4 2 0 5 1 12 7 3 6 1 2 0 4 1 4 6 1 0 4 1 6 3 6 5 1 0 14 3 3 5 3 0 2 1 0 5 1 8 7 6 3 6 1 0 0 3 3 2 5 2 1 0 5 5 0 3 0 1 6 2 1 1 1 3 2 9 4 3 1 6 3 5 9 1 1 1 2 2 1 4 1 4 3 5 2 0 2 0 1 3 5 1 0 5 2 3 3 0 0 2 3 0 1 0 1 4 1 2 1 0 1 2 6 2 1 7 2 0 3 0 3 1 3 3 1 1 1 2 2 0 0 1 2 0 4 2 0 2 0 2 1 0 1 3 1 1 0 3 0 2 1 1 2 0 1 2 0 2 7 15 7 0 1 1 1 0 0 1 0 2 0 4 0 0 0 0 3 1 0 1 1 0 0 0 0 0 2 1 0 3 1 1 1 0 0 3 0 1 0 1 0 0 1 1 1 1 1 0 1 0 2 1 1 3 4 0 0 0 0 0 1 0 5 1 0 0 0 2 0 0 1 0 1 0 0 1 0 3 1 0 3 0 0 2 0 0 0 4 1 3 0 0 1 2 0 3 0 0 0 0 0 0 0 2 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 2 2 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 1 1 1 0 2 0 0 0 1 1 8 1 0 0 0 0 0 0 5 1 1 0 0 0 2 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 2 0 0 1 0 0 0 0 1 1 2 1 0 0 0 1 0 1 1 0 0 0 0 1 3 0 0 0 0 0 0 0 0 0 0 0 0 3 2 2 3 0 1 0 0 0 0 2 0 3 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0

Yenaled commented 12 months ago

Cool! That's exactly what you need to provide as the fragment length file to quant-tcc.

biobenkj commented 12 months ago

Thanks! I'll give it a whirl. kb python is fantastic and really enjoy using it.

github-actions[bot] commented 11 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days

biobenkj commented 11 months ago

This works perfectly, closing the issue. Thanks again!

pachterlab / kb_python

Effective length normalization for full-length UMI scRNA-seq data #218