pachterlab / kallisto-D

BSD 2-Clause "Simplified" License
5 stars 3 forks source link

Running lamanno workflow after kallisto-D index in nuclei data #9

Open mcortes-lopez opened 1 year ago

mcortes-lopez commented 1 year ago

Hi, I am trying to do velocity analysis in nuclear data, implementing some of the ideas discussed in your paper. I followed some of the steps contained in the scripts from HSHMP_2022. Then I realized that to work with the kb count output in scVelo I would need to run the lamanno workflow (I read in some kallisto issue that the nuclei workflow is not advised, this after I tried it and no spliced/unspliced matrices were found in my loom object). However, something seems to fail during the quantification, as I get empty unspliced matrices. The spliced matrices seem to be fine.

Here are the basic steps I follow:

SCRIPTSPATH="${MAINDIR}/tools/HSHMP_2022/"
GTF="${MAINDIR}/tools/genomes/Homo_sapiens/genes.gtf.gz"
GENOMEFA="${MAINDIR}/tools/genomes/Homo_sapiens/genome.fa.gz"
OUTPATH="${MAINDIR}/glioma_project/velocity/KallistoOFF_nuclei/"
OUTPATH_ORIG="${MAINDIR}/glioma_project/velocity/KallistoOFF/"

mkdir -p ${OUTPATH}

# Current kallisto
kallisto_path="/gpfs/commons/home/mcorteslopez/.conda/envs/user_mamba/envs/scrna-ambiguity/bin/kallisto"

# Most recent (26.04.2023) kallisto-D version
kallistod="/gpfs/commons/home/mcorteslopez/opt/bin/kallisto"

# Usual kallisto cDNA index

out_dir="${OUTPATH}/kallisto_index_cdna/"
mkdir -p $out_dir

kb ref -i $out_dir/index.idx -g $out_dir/t2g.txt -f1 $out_dir/cdna.fa -f2 $out_dir/intron.fa \
 -c1 $out_dir/cdna_t2c.txt -c2 $out_dir/intron_t2c.txt \
 --verbose --workflow nucleus $GENOMEFA $GTF

kallisto index -i $out_dir/index.idx $out_dir/cdna.fa $out_dir/intron.fa

# Index nascent transcripts index

prev_dir="$out_dir"
out_dir="${OUTPATH}/kallisto_index_offlist/"
mkdir -p $out_dir
cp $prev_dir/* $out_dir

$kallistod index -t 44 --d-list ${OUTPATH}/hg38nascent_tx.fa -i $out_dir/index.idx $out_dir/cdna.fa $out_dir/intron.fa

# 10x snRNA files
data_files_nuc="TKU3851_S1_L004_R1_001.fastq.gz TKU3851_S1_L004_R2_001.fastq.gz TKU3851_S2_L004_R1_001.fastq.gz TKU3851_S2_L004_R2_001.fastq.gz TKU3851_S3_L004_R1_001.fastq.gz TKU3851_S3_L004_R2_001.fastq.gz TKU3851_S4_L004_R1_001.fastq.gz TKU3851_S4_L004_R2_001.fastq.gz"

# Run standard kallisto
#/usr/bin/time -v kb count --overwrite --loom --gene-names \
    -i ${OUTPATH}/kallisto_index_cdna/index.idx \
    -g ${OUTPATH}/t2g.txt \
    -c1 ${OUTPATH}/kallisto_index_cdna/cdna_t2c.txt \
    -c2 ${OUTPATH}/kallisto_index_cdna/intron_t2c.txt \
    -o ${OUTPATH}/results/kallisto_index_cdna/ -t 30 -x 10XV3 \
    --workflow lamanno ${data_files_nuc} 1> ${OUTPATH}/logs/sn_TKU3851_kallisto_standard_stdout.txt 2> ${OUTPATH}/logs/sn_TKU3851_kallisto_standard_stderr.txt

# Run kallisto offlist
/usr/bin/time -v kb count --kallisto $kallistod --overwrite --loom --gene-names \
    -i ${OUTPATH}/kallisto_index_offlist/index.idx \
    -g ${OUTPATH}/t2g.txt \
    -c1 ${OUTPATH}/kallisto_index_offlist/cdna_t2c.txt \
    -c2 ${OUTPATH}/kallisto_index_offlist/intron_t2c.txt \
    -o ${OUTPATH}/results/kallisto_index_offlist/ -t 40 -x 10XV3 \
    --workflow lamanno ${data_files_nuc} 1> ${OUTPATH}/logs/sn_TKU3851_kallisto_offlist_stdout.txt 2> ${OUTPATH}/logs/sn_TKU3851_kallisto_offlist_stderr.txt

I get a unspliced bus file, and the json inspect does not look so different from the spliced, however, the counts matrix that is contained in the loom object is empty. I suspect the issue is in the -c1 and -c2 files, because I do not think kallisto-D has an option to generate these files. Is there a way to obtain these files? Or would you recommend another strategy to be able to work with the kallisto-D quantifications in scVelo. Also, on these lines, would potentially kallisto-D be useful to mask regions in the genome that might alter the quantifications? Like regions that tend to be mis-primed?

Yenaled commented 1 year ago

Yeah, there's really no need for a nucleus-specific workflow for kb count; it makes sense to simply produce the three matrices (via what we currently call the lamanno workflow) and sum up the ambiguous and nascent (unspliced) matrices for nucleus data since we're indexing nascent+mature transcripts anyway.

First, I recommend installing the latest version of kallisto-D, the kb_python dlist branch, and the bustools dlist branch. Supply the bustools path to kb count via --bustools $bustools .

Second, kallisto-D cannot generate the c1 and c2 files (kb ref is responsible for doing all the fasta/gtf manipulation+extraction stuff).

At the time of the preprint, we were unable to create an RNA velocity workflow (because that required modifications to multiple different software, including both kb-python and bustools as well as one of kb-python's dependencies), therefore we only focused on generating a single count matrix which was enough to illustrate our point about accurately classifying what's "spliced", what's "unspliced", what's in-between, and how ambiguous things are allocated in cellular vs. nuclei workflows (by current convention using strong albeit bad assumptions) and how to filter out false positive mappings in all cases. I had wanted to create an RNA velocity workflow at that time but I had some personal issues (unfortunately, some rather serious medical issues) come up at the time.

Some time later, I finally completed the RNA velocity workflow (which, as you correctly point out and as alluded to in the preprint, involves indexing nascent transcripts in addition to cDNA).

Install the newest version of kb-python and do the following:

kb ref --kallisto=$kallisto --d-list=genome.fasta --workflow=lamanno -t 44 -i index.idx -g t2g.txt -c1 c1.txt -c2 c2.txt -f1 cdna.fa -f2 nascent.fa genome.fasta genome.gtf

Note that the --d-list option above is optional. It basically creates relevant distinguish k-mers across the entire genome to prevent random crap in intergenic regions from decreasing mapping specificity (not really necessary unless you're dealing with a species with crappy annotation and it will make "kb ref" take longer and will make mapping consume a couple gigabytes more memory).

Anyway, proceed to run kb count:

kb count --sum=cell --workflow=lamanno ...

You'll get a cells_x_genes.cell.mtx (these are your "spliced counts" -- the ambiguous counts are allocated into this matrix) and you'll get a cells_x_genes.2.mtx (these are your "unspliced counts"). You feed these two matrices into your RNA velocity workflow.

You can also use --loom and --h5ad to get those two matrices in those file formats.

P.S. We're still updating the preprint (the first version presented some good ideas but needs some serious polishing and few significant corrections).

P.P.S. For future issues, just use the main kallisto repo since I think we're going to migrate there soon for this new release (which should be coming in a matter of weeks if not days).