pachterlab / kb_python

A wrapper for the kallisto | bustools workflow for single-cell RNA-seq pre-processing
https://www.kallistobus.tools/
BSD 2-Clause "Simplified" License
141 stars 24 forks source link

Kb-python version differences (0.26.4 vs 0.27.2) resulting in different count numbers #170

Closed AlinaKurjan closed 2 years ago

AlinaKurjan commented 2 years ago

Hi, I am getting different count outputs (overall, spliced and unspliced) depending on the versions of kb python used. For example if counts were calculated with the following command:

kb count -i geneset.dir/index.idx -g geneset.dir/t2g.txt -x 10xv3 --h5ad -o scflow/DEV15983 --workflow nucleus -t 30 -m 30G -c1 geneset.dir/cdna_t2c.txt -c2 geneset.dir/intron_t2c.txt ACH-MB-H_R1_001.fastq.gz ACH-MB-H_R2_001.fastq.gz --verbose

the output files in an older environment with kb python 0.26.4 would look like this:

h5ad: AnnData object with n_obs × n_vars = 285094 × 61552
    var: 'gene_name'
spliced (object loaded as anndata):  AnnData object with n_obs × n_vars = 888690 × 61552
unspliced (object loaded as anndata): AnnData object with n_obs × n_vars = 601602 × 61552

while the output files in an environment with kb python 0.27.2 would look like this:

h5ad: AnnData object with n_obs × n_vars = 496157 × 61552
    var: 'gene_name'
spliced (object loaded as anndata): AnnData object with n_obs × n_vars = 890347 × 61552
unspliced (object loaded as anndata): AnnData object with n_obs × n_vars = 1265619 × 61552

The reference files used as input for the counts function were the same for both versions. The output files of kb ref are the same when run with either version of kb-python.

Python version is 3.8.13

Please advise 1) what could be causing this difference in alignment, and 2) why the spliced/unspliced counts are not loaded into the output h5ad file as layers with the nucleus workflow? What would be the best way to add them in as layers manually?

Thank you for your time.

Yenaled commented 2 years ago

Can you post the number of reads aligned (or ideally the full output) in both cases? One update made to kallisto between the two versions is that strandedness is now taken into account when mapping 10X v3 reads (since the reads are stranded). You can remove strand-aware mapping via the --strand unstranded option in kb count.

AlinaKurjan commented 2 years ago

I'll check the output and get back here with details.

Could you also please advise on the second point - why the spliced/unspliced counts are not loaded into the output h5ad file as layers with the nucleus workflow? What would be the best way to add them in as layers manually?

AlinaKurjan commented 2 years ago

Hi @Yenaled, thank you for your explanation earlier. Using --strand unstranded option indeed reproduces the results that are seen with 0.26.4. It is interesting because we would expect to see more unspliced counts in our data by default since it is single-nucleus data, however taking the strandedness into account leads to twice as few unspliced reads being detected. Would you advise using unstranded alignment in this case?

Yenaled commented 2 years ago

I'd probably still recommend running it in the default stranded alignment mode because 10X reads are stranded so that's the more correct way to map the reads. It is possible that you see more unspliced reads in stranded mode because some reads that couldn't be mapped before can now be mapped. Although stranded mapping is more strict, it is possible to get more reads with that more strict mapping due to how pseudoalignment works (finding compatible transcripts from k-mers and taking the intersection of them).

In any case, I don't expect your downstream analysis to be that drastically different either way (if it is, then I'd definitely like to know).

AlinaKurjan commented 2 years ago

It is more strict, I get twice as many reads when strandedness is not taken into account. So this makes sense, thank you.