pachterlab / kb_python

A wrapper for the kallisto | bustools workflow for single-cell RNA-seq pre-processing
https://www.kallistobus.tools/
BSD 2-Clause "Simplified" License
141 stars 23 forks source link

kb count - joint vs individual sample processing for RNA velocity analysis #244

Closed NikTuzov closed 3 months ago

NikTuzov commented 3 months ago

RNA velocity analysis consists of two steps:

1) Applying “kb count” to FASTQ files 2) Applying scvelo to the .h5ad file(s) obtained from 1).

Question: does one have to enforce consistency between 1) and 2) as follows: if in 1) “kb count” is run separately for each sample, then each sample’s .h5ad file should be processed separately in 2) and vice-versa.

In particular, in this La Manno example with two samples:

https://www.kallistobus.tools/tutorials/kb_velocity/python/kb_velocity/

“kb count” is run separately for each sample, but then the two .h5ad files are merged in scvelo and a single velocity visualization is produced. Perhaps they should have produced two separate visualizations instead.

In other words, when one applies kb count to a given sample, does the result depend on what other samples are (not) included in the same kb command. If the alignment or quantification algorithm is theoretically dependent on the presence of other samples, then one has to enforce consistency between 1) and 2).

Yenaled commented 3 months ago

If you have two completely different single-cell sequencing samples, then you should run kb count separately on them.

The reason is that some cells from sample 1 may have the same barcode as some cells from sample 2.

And, if the two samples are truly different, you should perform downstream analysis of them separately.

I don't remember that example too well -- but I think there was probably negligible batch effects and that's why they were merged together in the downstream analysis. But they needed to be processed by kb count separately anyway because of the barcode issue.

NikTuzov commented 3 months ago

Thank you for replying, Delaney.

I looked into that example and found the following. There are about 400K barcodes in each of the two h5ad files generated by kb count. The barcodes are unique within each sample, but about 220K are found in both samples.

After they load the data into Python for scvelo, the first step is to merge the samples. They essentially create a new cell id equal to "sample id + barcode", so these new ids are unique across samples. Therefore, in the final UMAP visualization there may be some points that are plotted separately, despite having identical original barcodes.