pachterlab / kallistobustools

kallisto | bustools workflow for pre-processing single-cell RNA-seq data
https://kallistobus.tools/
MIT License
115 stars 30 forks source link

Problem running kb count w/ multiple intron files #29

Closed stmartineau closed 2 years ago

stmartineau commented 2 years ago

I am trying to run the RNA velocity pipeline in Google Colab using my own data. Here is what I ran to build the velocity index:

!kb ref -i index.idx -g t2g.txt -f1 cdna.fa -f2 intron.fa -c1 cdna_t2c.txt -c2 intron_t2c.txt --workflow lamanno -n 4 \ Mus_musculus.GRCm38.dna.primary_assembly.fa.gz \ Mus_musculus.GRCm38.98.gtf.gz

This results in 4 different index intron files (index.idx_intron.0, index.dx_intro.1, ...) plus a index.idx_cdna file. It seems that kb count doesn't accept more than 1 index file as input because when I run the following line I get an error:

!kb count --h5ad -i index.idx_cdna index.idx_intron0 index.idx_intron1 index.idx_intron2 index.idx_intron3 -g t2g.txt -x 10xv3 -o P17 \ -c1 spliced.txt -c2 unspliced.txt --workflow lamanno --filter bustools \ K37G9/7388-P1_S1_L001_I1_001.fastq.gz \ K37G9/7388-P1_S1_L001_I2_001.fastq.gz \ K37G9/7388-P1_S1_L001_R1_001.fastq.gz \ K37G9/7388-P1_S1_L001_R2_001.fastq.gz

Here is the error: kb: error: unrecognized arguments: K37G9/7388-P1_S1_L001_I1_001.fastq.gz K37G9/7388-P1_S1_L001_I2_001.fastq.gz K37G9/7388-P1_S1_L001_R1_001.fastq.gz K37G9/7388-P1_S1_L001_R2_001.fastq.gz

It runs when I have only 1 index file as input. Please help me figure out what I am doing wrong here.

Thanks.

amcdavid commented 2 years ago

You need to comma-separate the index files, ie, !kb count --h5ad -i index.idx_cdna,index.idx_intron0,index.idx_intron1,index.idx_intron2,index.idx_intron3 -g t2g.txt -x 10xv3 -o P17 \ -c1 spliced.txt -c2 unspliced.txt --workflow lamanno --filter bustools \ K37G9/7388-P1_S1_L001_I1_001.fastq.gz \ K37G9/7388-P1_S1_L001_I2_001.fastq.gz \ K37G9/7388-P1_S1_L001_R1_001.fastq.gz \ K37G9/7388-P1_S1_L001_R2_001.fastq.gz

Lioscro commented 2 years ago

Hi, @stmartineau, @amcdavid is correct -- you need to provide them as a comma-delimited list without spaces.

However, we don't recommend using split indices as they will cause read loss. We will be deprecating this feature in the next major release. Hopefully by then, the new indexing scheme (which drastically reduces memory usage) we are working on currently will be available.

In the meantime, we recommend building a "light" index (memory usage will drop by approximately 1/2) as in this comment. https://github.com/pachterlab/kb_python/issues/117#issuecomment-1027201134

Note that if you really must use Google Colab, unfortunately you don't have much choice than just dealing with possible read loss.