pachterlab / kallisto

Near-optimal RNA-Seq quantification
https://pachterlab.github.io/kallisto
BSD 2-Clause "Simplified" License
649 stars 171 forks source link

Kallisto total UMI difference between velocity and novelocity modes. (single cell) #419

Open mortunco opened 9 months ago

mortunco commented 9 months ago

Hello,

Thank you for keeping kallisto alive. These days, more and more I realise how its actually important and hard maintain and update tools across many years.

Background:

I am processing same 10x 5' v2 chemistry sample fastqs with kallisto. I found great number of total UMI difference between velocity and nonvelocity runs. I shared the commands i used to run the scripts

Velocity mode

conda activate ~/mambaforge/envs/kallisto
KALLISTO_INDEX_DIR="foo/bar/kallisto-velocity-index" ### downloaded with kb ref -d human --workflow nac
kb count --h5ad -i $KALLISTO_INDEX_DIR/index.idx -g $KALLISTO_INDEX_DIR/t2g.txt -x 10xv2 --workflow nac --h5ad -o kallisto \
        -c1 $KALLISTO_INDEX_DIR/spliced_t2c.txt -c2 $KALLISTO_INDEX_DIR/unspliced_t2c.txt --filter bustools \
            SRR13806026_S1_R1_001.fastq \
                SRR13806026_S1_R2_001.fastq

velocity ref

$ ls kallisto-velocity-index
index.idx  spliced_t2c.txt  t2g.txt  unspliced_t2c.txt

Nonvelocity mode

KALLISTO_INDEX_DIR="foo/bar/kallisto-prebuilt-index" ### downloaded with kb ref -d human --workflow standard
kb count --h5ad -i $KALLISTO_INDEX_DIR/index.idx -g $KALLISTO_INDEX_DIR/t2g.txt -x 10xv2 --workflow standard --filter bustools -o kallisto_novelo_prebuilt SRR13806026_S1_R1_001.fastq SRR13806026_S1_R2_001.fastq

novelocity ref

$ ls kallisto-prebuilt-index
human_index_standard.tar.xz  index.idx  t2g.txt

Attempts

I calculated total UMI counts using the following lines. For this question please ignore the other kallisto runs. Although I used them trying to figure out whats going on, its not related to this question. Here you can see the total UMI difference between velocity and nonvelocity_prebuilt runs. One important note here is that I added ambigous + spliced and unspliced to velocity.

kallisto_novelocity_prebuilt=ad.read_h5ad("foo/bar/sample/kallisto_novelo_prebuilt/counts_filtered/adata.h5ad")
kallisto_novelocity_prebuilt.var_names_make_unique()

kallisto_velocity=ad.read_h5ad("foo/bar/sample/kallisto_velocity/counts_filtered/adata.h5ad")
kallisto_velocity.var_names_make_unique()

total_umi["kallisto_velocity"]= kallisto_velocity.layers["spliced"].sum().sum() + kallisto_velocity.layers["ambiguous"].sum().sum() + kallisto_velocity.layers["unspliced"].sum().sum()
total_umi["kallisto_novelocity_prebuilt"]= kallisto_novelocity_prebuilt.X.sum().sum()

image

In other comparison, you may see that even we add spliced+ambigous it still does not add up to velocity.

print(f' kalisto velocty spliced: {kallisto_velocity[list(common_barcodes),:].layers["spliced"].sum().sum()}')
print(f' kalisto velocty ambigous + spliced: {kallisto_velocity[list(common_barcodes),:].layers["spliced"].sum().sum() + kallisto_velocity[list(common_barcodes),:].layers["ambiguous"].sum().sum()}')
print(f' kalisto noveloctiy: {kallisto_novelocity_prebuilt[list(common_barcodes),:].X.sum().sum()}')

kalisto velocty spliced: 216012.0
kalisto velocty ambigous + spliced: 781301.0
kalisto noveloctiy: 905870.0

Questions

1) It is clearly the impact of index + workflow. But in terms of good practice could you please show me guidance. I would like to run kallisto only once and use spliced+unspliced counts for standard sc analysis calculation and their ratio for the velocity/trajectory analysis.

Thank you very much for your time and patience,

Best regards,

Tunc.

Yenaled commented 9 months ago

Hello! I just looked at the dataset -- for some reason, there seems to be strand-bias [causing mapping rate < 30%] under the default 10x settings (forward-stranded) so set --strand=unstranded.

As for your actual question: Why do mature+ambiguous gives fewer UMIs than just standardly mapping to the transcriptome (while discarding out-of-transcriptome reads)? This is because of UMI deduplication. In standard index, the same UMI sequences may simply map to a single gene, but in nac index, some of the same UMI sequences may map to multiple places (because there are more things in the index) -- and hence get filtered out. tl;dr it's because of UMI counting.

And yes, you want to sum ambiguous+mature.

mortunco commented 9 months ago

Dear Yanaled, I am gonna give --strand=unstranded a try. Thank you very much for the hasty reply. I really appreciate it.

NikTuzov commented 9 months ago

Hello Tunc:

I am using a similar velocity workflow and I was wondering if you could please answer my question here: https://github.com/pachterlab/kb_python/issues/228#issuecomment-1852895595

Yenaled commented 9 months ago

@mortunco I just released a new version of kb-python (version 0.28.1) -- it should cause less UMI differences between the standard index and the nac index.

Previously, if there was an exon/intron overlap of two different genes, the intron was being prioritized (which caused issues since a UMI could map to many introns). Now, the exon is being prioritized and, from what I tested, there is very high agreement between the standard index and the nac index.