Pathway separation - Githubissues

Closes #40

Here, I'm adding a notebook (32-explore_pathway_separation.Rmd) that looks into the notion of "pathway separation." The logic in CheckPathwaySeparation is the part where another set of :eyes: would be most appreciated/helpful!

I define pathway separation to be: Given two sets of related gene sets (a set of related gene sets would be all the monocyte/macrophage gene sets) that are somewhat similar (e.g., neutrophil set and monocyte/macrophage set), does the model: 1) have at least one latent variable significantly associated with at least one pathway in both sets (e.g. both sets are captured)? 2) have each set uniquely or separately represented (e.g., there exists at least one latent variable that is not significantly associated with both sets of and this is true for both sets)?

To be a bit more concrete: a model has at least one neutrophil-associated LV that is not associated with any of the monocyte/macrophage gene sets and at least one monocyte/macrophage-associated LV that is not associated with any of the neutrophil gene sets.

I'm checking for the separation of three sets of pathway pairs:

IFN - Type I and Type II interferon signaling pathways
MYELOID - neutrophil and monocyte/macrophage
PROLIFERATION - the G1 and G2 phases of the cell cycle (this is the one that'd I'd expect to be pretty difficult)

Figure, edited from the "raw" PDFs in this PR:

pathway_separation_better_contrast

Summary:

For random subsampling, the larger the sample size, the better the pathway separation.
The biological conditions included in the training set matter! I would not expect cell line to perform particularly well, as these are unlikely to be mixtures of immune cells (generally speaking) or experiments where we expect the IFN signalling pathways to be probed outside of a handful of experiments.

greenelab / multi-plier

Pathway separation #52