pachterlab / kb_python

A wrapper for the kallisto | bustools workflow for single-cell RNA-seq pre-processing
https://www.kallistobus.tools/
BSD 2-Clause "Simplified" License
141 stars 23 forks source link

Using kb -count with SMART-Seq mRNA LP (with UMIs) #229

Closed franziskadenk closed 6 months ago

franziskadenk commented 6 months ago

Dear Pachter team - first of all - thanks so much for providing such great tools for the community!

I have been using kb bustools on Google Colabs for several years now to align plate-based scRNA-seq data. It always works wonderfully. So far, we have either used SMARTseq2 or SMARTseq3, but now a collaborator has sent me some data from an even newer kit (apparently 'almost version 4') called SMART-Seq mRNA LP (with UMIs).

My issue is - is there a way I can align this with kb - count? I only have two fastq files for each sample (r1, r2) and no index files?

kb count gives me an output if I run it as normal with '-SMARTSEQ2', but then presumably I lose the UMI information? When I try and run it with '-SMARTSEQ3' it complains, because it also expects two index files each...

Any advice would be much appreciated. And apologies if it is a dumb question - I'm self-taught, so not the most versatile when I need to deviate from 'standard' code.

Thank you so much!

Yenaled commented 6 months ago

It’s doable but I have a question: Where are the UMIs in the read and how long are they? Also, I’m assuming you have a bunch of fastq files and each one represents a unique cell?

I tried looking it up but couldn’t find much detail online about this new kit.

franziskadenk commented 6 months ago

Thanks so much, Yenaled, for your fast reply!!

Yes, I have a pair of FastQ files for each cell in a 96 well plate.

The UMIs are 8bp long. As to where they are, that is harder to determine; this is the info on the kit: https://www.takarabio.com/documents/User%20Manual/SMART/SMART-Seq%20mRNA%20LP%20%28with%20UMIs%29%20User%20Manual.pdf

Figure 3 suggests that they are 5' UMIs, wedged between the adapter and the read sequence.

Thanks again for any advice!

Yenaled commented 6 months ago

Hmm ok, I'm not sure -- e.g. not sure the sequence of the TSO that was used, etc.

I guess you could run FASTQC on the files and a few hundred reads manually to see what the library structure looks like.

Btw, you could always reformat the 96 files into a single large FASTQ file and try processing it like smart-seq3 data. An example of a tool is https://splitcode.readthedocs.io/en/latest/remultiplex_guide.html (and you can simply supply the resulting barcodes.fastq.gz twice -- for the two index files). I'm just not sure if mRNA LP is the same read structure as smart-seq3.

franziskadenk commented 6 months ago

This works like a charm until an issue with the onlist. I presume to solve this, I should provide an onlist? Is that simply the barcodes fast file again? So sorry for all the dumb questions and thanks again for your help!

This was the command I ran after remultiplexing: !kb count --verbose --overwrite --h5ad -i '/content/drive/MyDrive/Ewan_2023/fastq/index.idx' -g '/content/drive/MyDrive/Ewan_2023/fastq/t2g.txt' -x SMARTSEQ3 -o '/content/drive/MyDrive/Ewan_2023/output' barcodes.fastq.gz barcodes.fastq.gz out_R1.fastq.gz out_R2.fastq.gz

This is the error I get: [2023-12-14 08:28:48,681] DEBUG [main] Printing verbose output [2023-12-14 08:28:50,805] DEBUG [main] kallisto binary located at /usr/local/lib/python3.10/dist-packages/kb_python/bins/linux/kallisto/kallisto [2023-12-14 08:28:50,805] DEBUG [main] bustools binary located at /usr/local/lib/python3.10/dist-packages/kb_python/bins/linux/bustools/bustools [2023-12-14 08:28:50,806] DEBUG [main] Creating /content/drive/MyDrive/Ewan_2023/output/tmp directory [2023-12-14 08:28:50,811] DEBUG [main] Namespace(list=False, command='count', tmp=None, keep_tmp=False, verbose=True, i='/content/drive/MyDrive/Ewan_2023/fastq/index.idx', g='/content/drive/MyDrive/Ewan_2023/fastq/t2g.txt', x='SMARTSEQ3', o='/content/drive/MyDrive/Ewan_2023/output', num=False, w=None, r=None, t=8, m='2G', strand=None, inleaved=False, genomebam=False, aa=False, gtf=None, chromosomes=None, workflow='standard', em=False, mm=False, tcc=False, filter=None, filter_threshold=None, c1=None, c2=None, overwrite=True, dry_run=False, batch_barcodes=False, loom=False, h5ad=True, loom_names='barcode,target_name', sum='none', cellranger=False, gene_names=False, N=None, report=False, no_inspect=False, kallisto='/usr/local/lib/python3.10/dist-packages/kb_python/bins/linux/kallisto/kallisto', bustools='/usr/local/lib/python3.10/dist-packages/kb_python/bins/linux/bustools/bustools', no_validate=False, no_fragment=False, parity=None, fragment_l=None, fragment_s=None, bootstraps=None, matrix_to_files=False, matrix_to_directories=False, fastqs=['barcodes.fastq.gz', 'barcodes.fastq.gz', 'out_R1.fastq.gz', 'out_R2.fastq.gz']) [2023-12-14 08:28:53,329] INFO [count] Using index /content/drive/MyDrive/Ewan_2023/fastq/index.idx to generate BUS file to /content/drive/MyDrive/Ewan_2023/output from [2023-12-14 08:28:53,329] INFO [count] barcodes.fastq.gz [2023-12-14 08:28:53,329] INFO [count] barcodes.fastq.gz [2023-12-14 08:28:53,329] INFO [count] out_R1.fastq.gz [2023-12-14 08:28:53,329] INFO [count] out_R2.fastq.gz [2023-12-14 08:28:53,329] DEBUG [count] kallisto bus -i /content/drive/MyDrive/Ewan_2023/fastq/index.idx -o /content/drive/MyDrive/Ewan_2023/output -x SMARTSEQ3 -t 8 --paired barcodes.fastq.gz barcodes.fastq.gz out_R1.fastq.gz out_R2.fastq.gz [2023-12-14 08:28:53,431] DEBUG [count] [2023-12-14 08:28:53,431] DEBUG [count] Warning: you asked for 8, but only 2 cores on the machine [2023-12-14 08:28:53,431] DEBUG [count] [bus] Using ATTGCGCAATG as UMI tag sequence [2023-12-14 08:28:53,431] DEBUG [count] [bus] Note: Strand option was not specified; setting it to --fr-stranded for specified technology [2023-12-14 08:29:09,501] DEBUG [count] [index] k-mer length: 31 [2023-12-14 08:29:13,314] DEBUG [count] [index] number of targets: 126,126 [2023-12-14 08:29:13,315] DEBUG [count] [index] number of k-mers: 110,241,893 [2023-12-14 08:29:13,315] DEBUG [count] [index] number of D-list k-mers: 2,765,905 [2023-12-14 08:29:13,415] DEBUG [count] [quant] will process sample 1: barcodes.fastq.gz [2023-12-14 08:29:13,415] DEBUG [count] barcodes.fastq.gz [2023-12-14 08:29:13,415] DEBUG [count] out_R1.fastq.gz [2023-12-14 08:29:13,415] DEBUG [count] out_R2.fastq.gz [2023-12-14 08:29:39,968] DEBUG [count] [quant] finding pseudoalignments for the reads ... done [2023-12-14 08:29:39,968] DEBUG [count] [quant] processed 941,680 reads, 640,062 reads pseudoaligned [2023-12-14 08:29:39,968] DEBUG [count] [quant] estimated average fragment length: 197.905 [2023-12-14 08:29:40,470] DEBUG [count] [2023-12-14 08:29:42,173] INFO [count] Sorting BUS file /content/drive/MyDrive/Ewan_2023/output/output.bus to /content/drive/MyDrive/Ewan_2023/output/tmp/output.s.bus [2023-12-14 08:29:42,174] DEBUG [count] bustools sort -o /content/drive/MyDrive/Ewan_2023/output/tmp/output.s.bus -T /content/drive/MyDrive/Ewan_2023/output/tmp -t 8 -m 2G /content/drive/MyDrive/Ewan_2023/output/output.bus [2023-12-14 08:29:42,275] DEBUG [count] Warning: Number of threads cannot be greater than or equal to 2. Setting number of threads to 2 [2023-12-14 08:29:44,885] DEBUG [count] partition time: 0.003512s [2023-12-14 08:29:44,986] DEBUG [count] all fits in buffer [2023-12-14 08:29:46,392] DEBUG [count] Read in 640062 BUS records [2023-12-14 08:29:46,393] DEBUG [count] reading time 0.024383s [2023-12-14 08:29:46,393] DEBUG [count] sorting time 0.133096s [2023-12-14 08:29:46,393] DEBUG [count] writing time 0.093886s [2023-12-14 08:29:46,393] INFO [count] On-list not provided [2023-12-14 08:29:46,393] INFO [count] Generating on-list /content/drive/MyDrive/Ewan_2023/output/whitelist.txt from BUS file /content/drive/MyDrive/Ewan_2023/output/tmp/output.s.bus [2023-12-14 08:29:46,393] DEBUG [count] bustools allowlist -o /content/drive/MyDrive/Ewan_2023/output/whitelist.txt /content/drive/MyDrive/Ewan_2023/output/tmp/output.s.bus [2023-12-14 08:29:47,797] DEBUG [count] Read in 354597 BUS records, wrote 20 barcodes to on-list with threshold 0 [2023-12-14 08:29:47,799] INFO [count] Inspecting BUS file /content/drive/MyDrive/Ewan_2023/output/tmp/output.s.bus [2023-12-14 08:29:47,799] DEBUG [count] bustools inspect -o /content/drive/MyDrive/Ewan_2023/output/inspect.json -w /content/drive/MyDrive/Ewan_2023/output/whitelist.txt /content/drive/MyDrive/Ewan_2023/output/tmp/output.s.bus [2023-12-14 08:29:49,203] INFO [count] Correcting BUS records in /content/drive/MyDrive/Ewan_2023/output/tmp/output.s.bus to /content/drive/MyDrive/Ewan_2023/output/tmp/output.s.c.bus with on-list /content/drive/MyDrive/Ewan_2023/output/whitelist.txt [2023-12-14 08:29:49,204] DEBUG [count] bustools correct -o /content/drive/MyDrive/Ewan_2023/output/tmp/output.s.c.bus -w /content/drive/MyDrive/Ewan_2023/output/whitelist.txt /content/drive/MyDrive/Ewan_2023/output/tmp/output.s.bus [2023-12-14 08:29:50,307] DEBUG [count] Error: on-list file malformed; no barcodes found [2023-12-14 08:29:50,307] ERROR [count] Error: on-list file malformed; no barcodes found [2023-12-14 08:29:50,308] ERROR [main] An exception occurred

Yenaled commented 6 months ago

In your kb count command, you can just put in: -w None

franziskadenk commented 6 months ago

There may be a more fundamental issue, as you suggested - below is what I'm now getting. Perhaps you have another easy solution? If not, I may run it without UMI info in SMARTSEQ2 for now, just to get an idea of what the data look like. This is only a pilot.

!kb count --verbose --overwrite --h5ad -i '/content/drive/MyDrive/Ewan_2023/fastq/index.idx' -g '/content/drive/MyDrive/Ewan_2023/fastq/t2g.txt' -x SMARTSEQ3 -w none -o '/content/drive/MyDrive/Ewan_2023/output' barcodes.fastq.gz barcodes.fastq.gz out_R1.fastq.gz out_R2.fastq.gz

[2023-12-14 08:47:14,559] DEBUG [main] Printing verbose output [2023-12-14 08:47:16,765] DEBUG [main] kallisto binary located at /usr/local/lib/python3.10/dist-packages/kb_python/bins/linux/kallisto/kallisto [2023-12-14 08:47:16,765] DEBUG [main] bustools binary located at /usr/local/lib/python3.10/dist-packages/kb_python/bins/linux/bustools/bustools [2023-12-14 08:47:16,766] DEBUG [main] Creating /content/drive/MyDrive/Ewan_2023/output/tmp directory [2023-12-14 08:47:16,769] DEBUG [main] Namespace(list=False, command='count', tmp=None, keep_tmp=False, verbose=True, i='/content/drive/MyDrive/Ewan_2023/fastq/index.idx', g='/content/drive/MyDrive/Ewan_2023/fastq/t2g.txt', x='SMARTSEQ3', o='/content/drive/MyDrive/Ewan_2023/output', num=False, w='none', r=None, t=8, m='2G', strand=None, inleaved=False, genomebam=False, aa=False, gtf=None, chromosomes=None, workflow='standard', em=False, mm=False, tcc=False, filter=None, filter_threshold=None, c1=None, c2=None, overwrite=True, dry_run=False, batch_barcodes=False, loom=False, h5ad=True, loom_names='barcode,target_name', sum='none', cellranger=False, gene_names=False, N=None, report=False, no_inspect=False, kallisto='/usr/local/lib/python3.10/dist-packages/kb_python/bins/linux/kallisto/kallisto', bustools='/usr/local/lib/python3.10/dist-packages/kb_python/bins/linux/bustools/bustools', no_validate=False, no_fragment=False, parity=None, fragment_l=None, fragment_s=None, bootstraps=None, matrix_to_files=False, matrix_to_directories=False, fastqs=['barcodes.fastq.gz', 'barcodes.fastq.gz', 'out_R1.fastq.gz', 'out_R2.fastq.gz']) [2023-12-14 08:47:19,305] INFO [count] Using index /content/drive/MyDrive/Ewan_2023/fastq/index.idx to generate BUS file to /content/drive/MyDrive/Ewan_2023/output from [2023-12-14 08:47:19,305] INFO [count] barcodes.fastq.gz [2023-12-14 08:47:19,305] INFO [count] barcodes.fastq.gz [2023-12-14 08:47:19,305] INFO [count] out_R1.fastq.gz [2023-12-14 08:47:19,305] INFO [count] out_R2.fastq.gz [2023-12-14 08:47:19,305] DEBUG [count] kallisto bus -i /content/drive/MyDrive/Ewan_2023/fastq/index.idx -o /content/drive/MyDrive/Ewan_2023/output -x SMARTSEQ3 -t 8 --paired barcodes.fastq.gz barcodes.fastq.gz out_R1.fastq.gz out_R2.fastq.gz [2023-12-14 08:47:19,407] DEBUG [count] [2023-12-14 08:47:19,407] DEBUG [count] Warning: you asked for 8, but only 2 cores on the machine [2023-12-14 08:47:19,407] DEBUG [count] [bus] Using ATTGCGCAATG as UMI tag sequence [2023-12-14 08:47:19,407] DEBUG [count] [bus] Note: Strand option was not specified; setting it to --fr-stranded for specified technology [2023-12-14 08:47:36,195] DEBUG [count] [index] k-mer length: 31 [2023-12-14 08:47:41,022] DEBUG [count] [index] number of targets: 126,126 [2023-12-14 08:47:41,022] DEBUG [count] [index] number of k-mers: 110,241,893 [2023-12-14 08:47:41,022] DEBUG [count] [index] number of D-list k-mers: 2,765,905 [2023-12-14 08:47:41,023] DEBUG [count] [quant] will process sample 1: barcodes.fastq.gz [2023-12-14 08:47:41,023] DEBUG [count] barcodes.fastq.gz [2023-12-14 08:47:41,023] DEBUG [count] out_R1.fastq.gz [2023-12-14 08:47:41,023] DEBUG [count] out_R2.fastq.gz [2023-12-14 08:48:07,173] DEBUG [count] [quant] finding pseudoalignments for the reads ... done [2023-12-14 08:48:07,173] DEBUG [count] [quant] processed 941,680 reads, 640,062 reads pseudoaligned [2023-12-14 08:48:07,173] DEBUG [count] [quant] estimated average fragment length: 197.905 [2023-12-14 08:48:07,574] DEBUG [count] [2023-12-14 08:48:09,177] INFO [count] Sorting BUS file /content/drive/MyDrive/Ewan_2023/output/output.bus to /content/drive/MyDrive/Ewan_2023/output/tmp/output.s.bus [2023-12-14 08:48:09,177] DEBUG [count] bustools sort -o /content/drive/MyDrive/Ewan_2023/output/tmp/output.s.bus -T /content/drive/MyDrive/Ewan_2023/output/tmp -t 8 -m 2G /content/drive/MyDrive/Ewan_2023/output/output.bus [2023-12-14 08:48:09,279] DEBUG [count] Warning: Number of threads cannot be greater than or equal to 2. Setting number of threads to 2 [2023-12-14 08:48:10,983] DEBUG [count] partition time: 0.002701s [2023-12-14 08:48:10,984] DEBUG [count] all fits in buffer [2023-12-14 08:48:12,287] DEBUG [count] Read in 640062 BUS records [2023-12-14 08:48:12,287] DEBUG [count] reading time 0.011352s [2023-12-14 08:48:12,287] DEBUG [count] sorting time 0.140314s [2023-12-14 08:48:12,287] DEBUG [count] writing time 0.03965s [2023-12-14 08:48:12,288] INFO [count] Inspecting BUS file /content/drive/MyDrive/Ewan_2023/output/tmp/output.s.bus [2023-12-14 08:48:12,288] DEBUG [count] bustools inspect -o /content/drive/MyDrive/Ewan_2023/output/inspect.json /content/drive/MyDrive/Ewan_2023/output/tmp/output.s.bus [2023-12-14 08:48:13,394] INFO [count] Capturing records from BUS file /content/drive/MyDrive/Ewan_2023/output/tmp/output.s.bus to /content/drive/MyDrive/Ewan_2023/output/output_internal.bus with capture list /content/drive/MyDrive/Ewan_2023/output/capture_nonUMI.txt [2023-12-14 08:48:13,394] DEBUG [count] bustools capture -o /content/drive/MyDrive/Ewan_2023/output/output_internal.bus -c /content/drive/MyDrive/Ewan_2023/output/capture_nonUMI.txt --umis /content/drive/MyDrive/Ewan_2023/output/tmp/output.s.bus [2023-12-14 08:48:14,512] DEBUG [count] Read in 354597 BUS records, wrote 154865 BUS records [2023-12-14 08:48:15,618] DEBUG [count] /content/drive/MyDrive/Ewan_2023/output/output_internal.bus passed validation [2023-12-14 08:48:15,619] INFO [count] Inspecting BUS file /content/drive/MyDrive/Ewan_2023/output/output_internal.bus [2023-12-14 08:48:15,619] DEBUG [count] bustools inspect -o /content/drive/MyDrive/Ewan_2023/output/inspect_internal.json /content/drive/MyDrive/Ewan_2023/output/output_internal.bus [2023-12-14 08:48:16,724] INFO [count] Sorting BUS file /content/drive/MyDrive/Ewan_2023/output/output_internal.bus to /content/drive/MyDrive/Ewan_2023/output/output_internal.unfiltered.bus [2023-12-14 08:48:16,724] DEBUG [count] bustools sort -o /content/drive/MyDrive/Ewan_2023/output/output_internal.unfiltered.bus -T /content/drive/MyDrive/Ewan_2023/output/tmp -t 8 -m 2G /content/drive/MyDrive/Ewan_2023/output/output_internal.bus [2023-12-14 08:48:16,826] DEBUG [count] Warning: Number of threads cannot be greater than or equal to 2. Setting number of threads to 2 [2023-12-14 08:48:18,129] DEBUG [count] partition time: 0.000603s [2023-12-14 08:48:18,129] DEBUG [count] all fits in buffer [2023-12-14 08:48:19,331] DEBUG [count] Read in 154865 BUS records [2023-12-14 08:48:19,331] DEBUG [count] reading time 0.002599s [2023-12-14 08:48:19,331] DEBUG [count] sorting time 0.022913s [2023-12-14 08:48:19,331] DEBUG [count] writing time 0.010183s [2023-12-14 08:48:19,332] INFO [count] Generating count matrix /content/drive/MyDrive/Ewan_2023/output/counts_unfiltered_internal/cells_x_genes from BUS file /content/drive/MyDrive/Ewan_2023/output/output_internal.unfiltered.bus [2023-12-14 08:48:19,333] DEBUG [count] bustools count -o /content/drive/MyDrive/Ewan_2023/output/counts_unfiltered_internal/cells_x_genes -g /content/drive/MyDrive/Ewan_2023/fastq/t2g.txt -e /content/drive/MyDrive/Ewan_2023/output/matrix.ec -t /content/drive/MyDrive/Ewan_2023/output/transcripts.txt --genecounts --cm /content/drive/MyDrive/Ewan_2023/output/output_internal.unfiltered.bus [2023-12-14 08:48:21,165] DEBUG [count] /content/drive/MyDrive/Ewan_2023/output/counts_unfiltered_internal/cells_x_genes.mtx passed validation [2023-12-14 08:48:21,166] INFO [count] Writing gene names to file /content/drive/MyDrive/Ewan_2023/output/counts_unfiltered_internal/cells_x_genes.genes.names.txt [2023-12-14 08:48:21,454] WARNING [count] 1599 gene IDs do not have corresponding valid gene names. These genes will use their gene IDs instead. [2023-12-14 08:48:21,516] INFO [count] Reading matrix /content/drive/MyDrive/Ewan_2023/output/counts_unfiltered_internal/cells_x_genes.mtx [2023-12-14 08:48:21,694] ERROR [main] An exception occurred Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/kb_python/main.py", line 1618, in main COMMAND_TO_FUNCTION[args.command](parser, args, temp_dir=temp_dir) File "/usr/local/lib/python3.10/dist-packages/kb_python/main.py", line 703, in parse_count count( File "/usr/local/lib/python3.10/dist-packages/ngs_tools/logging.py", line 62, in inner return func(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/kb_python/count.py", line 1561, in count convert_matrix( File "/usr/local/lib/python3.10/dist-packages/kb_python/count.py", line 717, in convert_matrix ) if tcc else import_matrix_as_anndata( File "/usr/local/lib/python3.10/dist-packages/kb_python/utils.py", line 673, in import_matrix_as_anndata anndata.AnnData(X=mtx.tocsr(), obs=df_barcodes, var=df_genes) File "/usr/local/lib/python3.10/dist-packages/anndata/_core/anndata.py", line 362, in init self._init_as_actual( File "/usr/local/lib/python3.10/dist-packages/anndata/_core/anndata.py", line 545, in _init_as_actual self._obs = _gen_dataframe( File "/usr/lib/python3.10/functools.py", line 889, in wrapper return dispatch(args[0].class)(args, **kw) File "/usr/local/lib/python3.10/dist-packages/anndata/_core/anndata.py", line 180, in _gen_dataframe_df raise _mk_df_error(source, attr, length, len(anno)) ValueError: Observations annot. obs must have as many rows as X has rows (16), but has 0 rows. [2023-12-14 08:48:21,706] DEBUG [main] Removing /content/drive/MyDrive/Ewan_2023/output/tmp directory

Yenaled commented 6 months ago

Hmm, what do the first few lines of /content/drive/MyDrive/Ewan_2023/output/counts_unfiltered_internal/cells_x_genes.mtx look like?

Because it looks like it's working (just the anndata is giving some problems).

Yenaled commented 6 months ago

I'm also trying a few things on my end (with smartseq2/3-like data) and seeing if I can reproduce the problems. Hold on a sec.

franziskadenk commented 6 months ago

Thanks!! This is what the first few lines look like:

%%MatrixMarket matrix coordinate real general % 16 34183 82873
1 15 1 1 19 4 1 24 1

Yenaled commented 6 months ago

Ah, yes, it is actually working!

I'll tell you what the problem is: Each entry in the barcodes.fastq.gz is 16-bp in length (and you're supplying two, so that makes 32-bp length -- for BUS files, barcodes should be <32-bp length).

Easy fix? Just make the barcodes in barcodes.fastq.gz shorter by trimming off the first nucleotide!

awk '{if(NR%2){print $0}else{print substr($0,2)}}' <(zcat barcodes.fastq.gz) > barcodes_new.fastq

And then supply barcodes_new.fastq twice.

franziskadenk commented 6 months ago

Genius! My collaborators will be very happy. Thank you SO much for all your great help - and in the middle in the night no less.

Have a wonderful holiday season!

biobenkj commented 6 months ago

Just to chime in - if it's a similar read structure of the takara pico V3, read 2 is 8 bp UMI - 6 bp linker - cDNA and read 1 is all cDNA (reverse stranded), you could throw the -x STORM-seq preset in kb count and be sure to add --umi-gene