thelovelab / tximport

Transcript quantification import for modular pipelines
134 stars 33 forks source link

tximport of kallisto h5s generated with multiple threads is not deterministic #58

Closed ning-y closed 5 months ago

ning-y commented 5 months ago

I quantified bulk RNA-seq with kallisto quant with default arguments (therefore, no bootstraps) and ten threads.

kallisto quant --threads 10 --index {input.idx} --output-dir {output.kout} {input.fqs}

I then imported the h5 files via tximport.

tximport("abundance.h5", tx2gene=tx2gene, type="kallisto", countsFromAbundance="scaledTPM")

The results of tximport are saved to TSV. If I repeat this process again with no changes to get another TSV, I find that the two TSVs are different.

$ md5sum scounts.tsv.gz old.scounts.tsv.gz
35c170054847cd6af28d35262022fc85  scounts.tsv.gz
1c5b5477e03c64f298363a48baecacdb  old.scounts.tsv.gz
$ zcat scounts.tsv.gz | sort | md5sum
c661db690d5efd537745312044f334df  -
$ zcat old.scounts.tsv.gz | sort | md5sum
4eb4196d95072574eb8ab5c6fb04106e  -

If I set kallisto to single-threaded execution, making no other changes, I get a deterministic result: the same TSV every run.

I report this issue here rather than with kallisto, because kallisto authors have already responded to threading and determinism here: https://github.com/pachterlab/kallisto/issues/236#issuecomment-565616059. They say that the multi-threaded kallisto output only looks different due to randomness in which threads finish first.

mikelove commented 5 months ago

So I don't think this is an issue. They are saying that the column ordering of the bootstrap matrices is not deterministic. This doesn't affect downstream tools which look at marginal stats like variance, or even covariance, which don't care about bootstrap ordering.