morris-lab / CellOracle

This is the alpha version of the CellOracle package
Other
312 stars 56 forks source link

Fixed inconsistent results of `oracle.get_links` across runs #196

Open PauBadiaM opened 6 months ago

PauBadiaM commented 6 months ago

Hi @KenjiKamimoto-wustl122

I have observed that the method oracle.get_links unfortunately returns different results across runs (celloracle == 0.18.0). While these differences are not huge (mean jaccard index of 0.9 between different runs), it is important to have a fixed seed to make results reproducible.

Even though you correctly use BaggingRegressor with a fixed seed, the problem comes upstream since you use sets to store TF gene symbols in oracle.TFdict. The problem with using sets is that their order is dependent on the current memory hash being used, meaning that at each run their order is going to be slightly different. This makes BaggingRegressor sample differently event though it uses the same seed all the time. However the solution is very easy, to fix the order of the selected TFs by sorting them alphabetically:

# Sort to fix seed
reg_all = sorted(reg_all)

With this simple change results are always the same.

Note that to get different results with the previous version you need to restart the kernel/run the script again so that the memory hash is restarted. Running the same code inside the same session in a jupyter lab will yield the same results but not if you restart the notebook. Hope this is helpful!