Open ajlee21 opened 3 years ago
Some analyses were performed here: https://github.com/greenelab/generic-expression-patterns/tree/master/explore_RNAseq_only_generic_genes
It looks like the VAE is artificially boosting lowly expressed genes in RNA-seq data, which allows them to be detected as DE. We think this VAE boosting isn't seen as much in array data due to the lower variance of array data compared to RNA-seq. Further test would need to be performed to examine the effect of different data types: array vs RNA-seq
When we compared the correlation between gene percentiles generated by SOPHIE versus the manually curated dataset here, we noticed that there was a group of genes that SOPHIE identified as generic but were not found to be generic using the manually curated dataset. In this case, SOPHIE was trained on recount2 (RNA-seq) dataset while the manually curated dataset was using array platform.
See https://github.com/greenelab/generic-expression-patterns/pull/75 for details:
Why is this compression not seen in the array data?
Possible solutions to consider: