Questions about data preprocessing

pondandmoon commented 1 week ago

Hi @jaydu1 ,

I noticed that scVAEIT uses a negative binomial distribution to model scRNA-seq and protein data. In this case, it makes sense to use the RNA or ADT count matrix as input to the model. However, I noticed that you have performed log normalization on the data, and I would like to ask about the reasoning behind this. I look forward to your response.

jaydu1 commented 1 week ago

Hi, thanks for the great question.

As long as the neural network has enough approximation ability, it should provide a good estimate for the conditional mean of each output gene/protein. So, in the case of log1p-transformed data, it can still give a reasonable estimate for the conditional mean of log1p of the count (despite that the variance may not be meaningful if the NB model is misspecified). From an optimization point of view, log1p-transformation ensures the input and output are on a common scale (typically 0~10), which may benefit the training of the neural networks.

I agree that using log1p-transformed data may not be technically reasonable, as you suggest. But I expect the performance will be similar even with the usual count data (after library size adjustment), because, after all, it is the conditional mean that matters.

Though in our paper and tutorials, we used log1p-transformed counts, the current model also allows input data before the log1p transformation, which is also commonly modeled by NB distribution. I can give it a try with the count data and update the example later when I get a chance.

pondandmoon commented 1 week ago

Hi @jaydu1 ,

Thank you for your detailed explanation. I think I know how to pre-process my data. Look forward to your more outstanding work!

jaydu1 / scVAEIT

Questions about data preprocessing #11