Batch correction experiments

This PR does a few things (sorry):

We were curious how much non-linear signal exists in the data for the mutation prediction problem. To analyze this, we used limma to perform linear batch correction, using mutation status in the target gene as the batch indicator. This effectively removes all linear predictive signal from the data.
As a non-linear model, I ended up using LightGBM. I ran a few experiments with some other non-linear models (sklearn GradientBoostingClassifier, sklearn MLPClassifier, sklearn RandomForest) but they were all too slow to scale to >1000 features in a reasonable amount of time.
As a control, we also looked at batch correction with cancer type as the batch indicator.

Results were pretty surprising when correcting out the mutation status labels:

So the linear models drop to essentially random performance after batch correction (as expected), but the non-linear models perform much better than before batch correction, essentially perfect classification most of the time.

My guess is that this is some sort of bug - still thinking about how to investigate it, open to ideas if you have any.

Are you batch effect correcting for one label at a time, or all mutation labels? If it's all labels, I could see there being residual signal left over after BE correction because the expression itself might be indicative of which dataset the sample came from?

Just one label at a time (e.g. if I'm predicting mutation status in TP53, I batch correct using the binary labels for TP53 is/is not mutated).

Have you tried running UMAP on the corrected data to see if anything falls out?

No, but that's a good idea. I may try that as part of my next PR.

I don't think the linear models themselves are random after correction, are they? I don't know what your baseline AUPR is, but your means are all the way from .8 to .1. That may be a clue

I also looked at the difference in AUPR compared with a model where the labels are shuffled:

(green is the linear model after batch correction). All of them seem to be around 0, in many cases less than 0, so I would say "basically random or worse than random" is fairly accurate.

This is actually a bit similar to your result where the linear models had balanced accuracy < 0.5 after batch correction (if I'm remembering correctly).

Other Hypotheses:

Program has a bug (everything looks alright to me implementation-wise, but given that your BE correction implementation uses mine as a starting point, I may be missing a shared bug)

When I met with Casey yesterday, we were talking about doing batch correction and then splitting into train/test sets (which is what we're both currently doing, as far as I can tell from your code) vs. doing batch correction on the train and test sets separately. He said you were thinking about switching to batch correcting train and test separately - does that sound right?

My plan for my next PR is to try something like this (either fitting two separate models, or batch correcting the training data then using the linear regression coefficients to batch correct the test data). Interested to see if that affects your results as well (if that is something you're working on).

All technical artifacts are linear, all important biology is nonlinear (but if there were perfect nonlinear signal, wouldn't the nonlinear models just use it in the first place?)

There is something nonlinear and correlated with the labels extremely strongly

Nonlinear correction artifacts like mentioned before, but in a binary case

Yeah, I think one of these would be my guess if there isn't a bug. I could maybe buy a bit of improvement with the nonlinear models, but the fact that the prediction is perfect or almost perfect in almost every case makes me think there's some sort of data leakage going on. Hopefully the next PR will shed some light - fingers crossed.

greenelab / mpmp

Batch correction experiments #71