satijalab / sctransform

R package for modeling single cell UMI expression data using regularized negative binomial regression
GNU General Public License v3.0
206 stars 33 forks source link

Consistency across computers/environments #68

Open Baboon61 opened 3 years ago

Baboon61 commented 3 years ago

Hi, due to Covid19 situation I am working part time at the lab and at home. At the lab I am working under an HP workstation linux environment and at home on MSI with windows 10 and Rstudio.

As input to the SCTransform function I use the same RDS object, same parameters, same R version, same Seurat and SCTransform versions, but I get slightly different residual (mean and variance) results after a SCTransform which lead to different UMAPs shapes.

As I detailed it in this biostars post, my hint is that this unconsistancy comes from how each machine is handling digits. Do you have more information on how SCTransform is dealing with a long list of digits ? As I mentioned in my post I was not able to get into the C_ksmooth function to continue my tracking as the function calls a C feature I have no access to.

Thanks a lot !

ChristophH commented 3 years ago

This is the first time I hear about this. We can try to figure out what is going on.

From your last post on biostars it looks like the model_pars matrices are already different (this comes before ksmooth is called). It also looks like only the theta column is different, which suggests that MASS::theta.ml is giving different results. That function has as a default epsilon parameter eps = .Machine$double.eps^0.25 - perhaps that leads to a different number of iterations between machines and as a result you get slightly different theta estimates.

The only solutions I can think of right now require modifications to sctransform (hard-code eps or make it a user-exposed parameter)

Baboon61 commented 3 years ago

I cannot find the MASS::theta.mlfunction you are talking about in the get_model_pars function or before in the vst function

In that one maybe :

if (method == 'nb_theta_given') {
          theta_given_bin_worker <- theta_given_bin[indices]
          return(fit_nb_theta_given(umi = umi_bin_worker, model_str = model_str, data = data_step1, theta_given = theta_given_bin_worker))
        }
ruiheesi commented 1 year ago

Hi, I'm testing reproducibility across platforms too, and we also see this inconsistency across platforms.

I tested a transform on BIOWULF, FRCE, an AWS instance with Ubuntu 20.02, and on RStudio Workbench. Interestingly the results from BIOWULF and FRCE are identical, but different from RSW and AWS. RSW and AWS results have same TSNE plot, but different UMAP. We also observe some sort of 180 degree rotation of UMAP plot between cluster results and others.

I checked the ".Machine$double.eps^0.25" just now, and RSW and AWS both reports 0.0001220703, so this may not be the case for me.

Can you let me know if there are other potential causes? Thanks.