Confusion about the data preprocessing

sheetalgiri commented 3 years ago

As far as I understood preprocessing steps for snare-seq are atac-seq dataset-> cistopic -> unit normalization rna-seq dataset-> unit normalization -> PCA-10 components

Is that correct?

I know this is a general machine learning question, but what did you use to choose the number of components when doing PCA for a different dataset? Which tool/settings do you recommend?

pinardemetci commented 3 years ago

Hi!

For epigenomic datasets, such as scATAC-seq and scMethyl-seq, we use cisTopic when preprocessing our data. You can try a range of dimensions and cisTopic has a funcionality to choose the one with the highest likelihood.

For PCA, we checked the percentage of variance explained by the chosen number of dimensions and made a decision based on that (e.g. >=75%).

For both, our process was as follows: data --> cisTopic/PCA --> unit normalization, so we applied unit normalization after dimensionality reduction for the datasets in our paper.

I hope this was helpful. Let me know if you have any questions :)

sheetalgiri commented 3 years ago

thanks, that makes it clear :)

rsinghlab / SCOT

Confusion about the data preprocessing #6