pachterlab / kb_python

A wrapper for the kallisto | bustools workflow for single-cell RNA-seq pre-processing
https://www.kallistobus.tools/
BSD 2-Clause "Simplified" License
141 stars 24 forks source link

No Reads With 5' Strand Alignment #108

Closed kanefos closed 3 years ago

kanefos commented 3 years ago

Describe the issue Firstly I just want to say I'm a huge fan of this tool and really appreciate how easy it is to run and the extremely clarity of the documentation. Decidedly one of the best tools I've worked with. For my project I am running alignment of 10x genomics Single Cell V(D)J 5’ Gene Expression library fastq files and acquiring practically zero aligned reads in my output file. My input are fastq files output by cellranger mkfastq to a cellranger_fastq directory. I am running 5' R2-only, but get same result on all fastq files. This is consistent across my samples.

I have aligned to a mouse index constructed as per this tutorial and am using the 10xv3 whitelist from here. When I run using the automatic 10xv3 and v2 whitelists I get the same result. The index from the Python intro tutorial also yields same low-read output result.

What is the exact command that was run?

kb count -i index_dir/transcriptome.idx -g index_dir/transcripts_to_genes.txt -x 10xv3 --h5ad -o output_dir  input_dir/sample_1_L001_R2_001.fastq.gz b  input_dir/sample_1_L002_R2_001.fastq.gz

When I analyse the output: Python:

d = ad.read_h5ad('adata.h5ad')
d.shape # 236494 × 55421
d.X.sum() # 137308, almost none aligned reads

Closing Is this because of a the 5' library I am using? I have read that while historically kallisto did not support stranded libraries (if I am interpreting the term correctly), it does now. I could not find any references to this in the documentation so was hoping I could get some support. For instance, do I need to construct a 5'-specific index?

Lioscro commented 3 years ago

Hi, @kanefos,

It seems like V(D)J libraries are not for measuring gene expression but novel rearrangements in specific genes, which isn't something kallisto | bustools can handle.

(Please feel free to correct me on this, @pmelsted)

kanefos commented 3 years ago

It seems like V(D)J libraries are not for measuring gene expression but novel rearrangements in specific genes, which isn't something kallisto | bustools can handle.

Sorry I may have worded my initial question oddly. This is gene expression using the 10x 5' chemistry kit (which is compatible with their immune receptor sequencing protocols)

pmelsted commented 3 years ago

For the invocation

kb count -i index_dir/transcriptome.idx -g index_dir/transcripts_to_genes.txt -x 10xv3 --h5ad -o output_dir input_dir/sample_1_L001_R2_001.fastq.gz b input_dir/sample_1_L002_R2_001.fastq.gz

first, the files should be R1 and R2, in that order. The I1 file is not needed. The R1 contains the umi+barcode sequences and the R2 contains the cdna sequence. In your input it has an extra b, this will confuse the program as it doesn't exist.

Can you show the output from running kb (or kallisto directly) with these modifications

kanefos commented 3 years ago

Yes thank you, running R1 then R2 in order resolved my issue. And thanks for extra helpful info about FASTQ files