Different runs upon the same input generated different results

kieranrcampbell / clonealign

Bayesian inference of clone-specific gene expression estimates by integrating single-cell RNA-seq and single-cell DNA-seq data

Apache License 2.0

32 stars 6 forks source link

Different runs upon the same input generated different results #3

Closed Puriney closed 5 years ago

Puriney commented 5 years ago

Hi Kieran, I noticed that despite the same input, different runs of clonealign generated different results. Setting the seed number does not help.

I was wondering how do you think of this type of fluctuation?

I am not surprised the same cell was assigned differently in different runs partly because it is an inference framework based on probability. In general, the assignments were consistent with each other, which is good. But could it be possible to make it reproducible the way how tSNE could be fixed when seed number is fixed, for example?

library(SingleCellExperiment)
library(clonealign)
data(example_sce)
copy_number_data <- rowData(example_sce)[,c("A", "B", "C")]
set.seed(42)
cal <- clonealign(example_sce, copy_number_data)
print(cal)
table(cal$clone)

kieranrcampbell commented 5 years ago

Thanks for your message. This is super interesting as it was one of the reviewer comments, and we tested for it and found fairly stable inference between runs on real data (I think 8 / 1000 cells changed assignments).

set.seed(...) won't work as the underlying tensorflow seed needs set - I will implement this and get back to you!

Out of interest, are you seeing this on real data or on the example_sce included? In the latter case it's such a small noisy dataset I wouldn't be surprised. And roughly what proportion of cells do you see changing assignment?

Thanks

Kieran

Puriney commented 5 years ago

Looking forward to the set.seed() alike function to control the randomness at code/programming level. Just in case, if it were not possible at the tensorflow layer to control it, how about additionally outputting the probability of assignment per cell, e.g., A: 0.41, B: 0.39, C: 0.20.

My real 10X data involved ~1.3K cells. I tested 5 times with the same codes and same input. They varied 2.7%. But this is data specific with its own level of noise as you mentioned.

It is a coincidence 😀 I am just a PhD student doring a rotation. Hope the peer-review goes well.

kieranrcampbell commented 5 years ago

You can now find a seed argument in the clonealign function call (fixed in 28dce1c7fae1b4597271cd49f1a33ff7237e7d09).

Note that this disables multithreading as apparently this introduces randomness (see https://tensorflow.rstudio.com/tensorflow/reference/use_session_with_seed.html). From testing the seed option it appears to give consistent outputs.

You can get the output of the clone probabilities per cell by looking in cal$ml_params$clone_probs

And thanks, it's been accepted in Genome Biology and should appear in publication in the next month or so.

Let me know if you have any more questions and if this fixes your seed issues; if so I'll close the issue.

Puriney commented 5 years ago

Many thanks for the quick update.

It is great news.

My issue is resolved. Thanks again.