greenelab / mpmp

Multimodal Pan-cancer Mutation Prediction
BSD 3-Clause "New" or "Revised" License
7 stars 6 forks source link

Batch correction experiments #71

Closed jjc2718 closed 2 years ago

jjc2718 commented 2 years ago

This PR does a few things (sorry):

Results were pretty surprising when correcting out the mutation status labels:

image

So the linear models drop to essentially random performance after batch correction (as expected), but the non-linear models perform much better than before batch correction, essentially perfect classification most of the time.

My guess is that this is some sort of bug - still thinking about how to investigate it, open to ideas if you have any.

jjc2718 commented 2 years ago

Are you batch effect correcting for one label at a time, or all mutation labels? If it's all labels, I could see there being residual signal left over after BE correction because the expression itself might be indicative of which dataset the sample came from?

Just one label at a time (e.g. if I'm predicting mutation status in TP53, I batch correct using the binary labels for TP53 is/is not mutated).

Have you tried running UMAP on the corrected data to see if anything falls out?

No, but that's a good idea. I may try that as part of my next PR.

I don't think the linear models themselves are random after correction, are they? I don't know what your baseline AUPR is, but your means are all the way from .8 to .1. That may be a clue

I also looked at the difference in AUPR compared with a model where the labels are shuffled:

image

(green is the linear model after batch correction). All of them seem to be around 0, in many cases less than 0, so I would say "basically random or worse than random" is fairly accurate.

This is actually a bit similar to your result where the linear models had balanced accuracy < 0.5 after batch correction (if I'm remembering correctly).

Other Hypotheses:

  1. Program has a bug (everything looks alright to me implementation-wise, but given that your BE correction implementation uses mine as a starting point, I may be missing a shared bug)

When I met with Casey yesterday, we were talking about doing batch correction and then splitting into train/test sets (which is what we're both currently doing, as far as I can tell from your code) vs. doing batch correction on the train and test sets separately. He said you were thinking about switching to batch correcting train and test separately - does that sound right?

My plan for my next PR is to try something like this (either fitting two separate models, or batch correcting the training data then using the linear regression coefficients to batch correct the test data). Interested to see if that affects your results as well (if that is something you're working on).

  1. All technical artifacts are linear, all important biology is nonlinear (but if there were perfect nonlinear signal, wouldn't the nonlinear models just use it in the first place?)
  2. There is something nonlinear and correlated with the labels extremely strongly
  3. Nonlinear correction artifacts like mentioned before, but in a binary case

Yeah, I think one of these would be my guess if there isn't a bug. I could maybe buy a bit of improvement with the nonlinear models, but the fact that the prediction is perfect or almost perfect in almost every case makes me think there's some sort of data leakage going on. Hopefully the next PR will shed some light - fingers crossed.