greenelab / pancancer-evaluation

Evaluating genome-wide prediction of driver mutations using pan-cancer data
BSD 3-Clause "New" or "Revised" License
9 stars 3 forks source link

Simulation of domain-specific data with shared signal #60

Closed jjc2718 closed 1 year ago

jjc2718 commented 1 year ago

Following on from previous feature selection/domain adaptation PRs like #48 and #59, we're thinking about implementing some additional methods for domain adaptation and domain generalization. It makes sense to validate them first on some simulated data that reasonably approximates different domains (i.e. different datasets or cancer types, etc), so this PR lays the groundwork for those simulations.

To simulate the data, we're using the generative model described in this paper. The general idea is that there is a latent variable that's shared across domains, and latent variables that are specific to each domain, which are multiplied by the label (+ some randomly sampled noise) to generate data points. The hope is that our models can extract the shared signal and use it to predict across domains.

Generally, we see much better performance within simulated domains than across them with standard (non-domain aware) models, so that makes sense! Next we'll try adding some domain adaptation methods.

image image

review-notebook-app[bot] commented 1 year ago

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB