Simulation of domain-specific data with shared signal

Following on from previous feature selection/domain adaptation PRs like #48 and #59, we're thinking about implementing some additional methods for domain adaptation and domain generalization. It makes sense to validate them first on some simulated data that reasonably approximates different domains (i.e. different datasets or cancer types, etc), so this PR lays the groundwork for those simulations.

To simulate the data, we're using the generative model described in this paper. The general idea is that there is a latent variable that's shared across domains, and latent variables that are specific to each domain, which are multiplied by the label (+ some randomly sampled noise) to generate data points. The hope is that our models can extract the shared signal and use it to predict across domains.

Generally, we see much better performance within simulated domains than across them with standard (non-domain aware) models, so that makes sense! Next we'll try adding some domain adaptation methods.

greenelab / pancancer-evaluation

Simulation of domain-specific data with shared signal #60