kundajelab / gecco-variants

0 stars 0 forks source link

Summary of classification model performance on Task V576 DNAse #6

Open annashcherbina opened 6 years ago

annashcherbina commented 6 years ago

Google spreadsheet with all information:

https://docs.google.com/spreadsheets/d/1gfbolLoB1o_oRHjGdV6ht5eTIjFHtTjIivlxsLugvp4/edit?usp=sharing

Performance matrix of models ## :

Yellow highlighting indicates performance on training dataset. Absence of yellow highlighting indicates performance on test dataset.

Blue text indicates performance on "version 1" of the data labels (i.e. negatives from peaks present in other CRC samples, but absent in current CRC sample).

Green text indicates performance on "version 2" of the data labels (i.e. negatives from ENCODE DNAse summits minus colon-specific data) image

Loss curves for most promising models in the data matrix:

image

Interestingly, it appears the baseline models are actually outperforming the models with GC-balanced negative sets, Dinucleotide-balanced negative sets, reverse complement values added.

Next steps: Try regression models

akundaje commented 6 years ago

These loss curves make no sense. Something is wrong with the training procedure or model spec.

-Anshul.

On Wed, May 16, 2018 at 3:05 PM, annashcherbina notifications@github.com wrote:

Google spreadsheet with all information:

https://docs.google.com/spreadsheets/d/1gfbolLoB1o_ oRHjGdV6ht5eTIjFHtTjIivlxsLugvp4/edit?usp=sharing Performance matrix of models ## :

Yellow highlighting indicates performance on training dataset. Absence of yellow highlighting indicates performance on test dataset.

Blue text indicates performance on "version 1" of the data labels (i.e. negatives from peaks present in other CRC samples, but absent in current CRC sample).

Green text indicates performance on "version 2" of the data labels (i.e. negatives from ENCODE DNAse summits minus colon-specific data) [image: image] https://user-images.githubusercontent.com/5261545/40138168-fc63dfcc-5900-11e8-8d5d-a1e5aca87c9c.png Loss curves for most promising models in the data matrix:

[image: image] https://user-images.githubusercontent.com/5261545/40137489-65c5e944-58ff-11e8-8c12-5422f64b31ff.png

Interestingly, it appears the baseline models are actually outperforming the models with GC-balanced negative sets, Dinucleotide-balanced negative sets, reverse complement values added.

Next steps: Try regression models

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/kundajelab/gecco-variants/issues/6, or mute the thread https://github.com/notifications/unsubscribe-auth/AAI7Ed-nh1jlLbOPshgv2sZIQDs3wAZxks5tzHhxgaJpZM4UB3jb .

annashcherbina commented 6 years ago

Yes, it seems to basically be overfitting to the training data. Hence I am hoping that some of the other architectures (i.e. those from Surag & Jacob) will be less prone to the overfitting problem.

I will move away from Basset for this dataset.

akundaje commented 6 years ago

I'm pretty sure its not the architecture. There seems to be something more fundamental going wrong here (some initialization or learning rate or one of those silent row/column transposition errors). Can you triple check how u are constructing the training and test sets. The model is either not learning in some of the plots or its just overfitting like crazy. This dataset is really not that different from any other DNase dataset. There is no reason why an architecture that works well almost universally seems to be failing miserably here.

We should ask Surag to train a model from scratch on your data to see what he gets so we have an independent sanity check.

-Anshul.

On Wed, May 16, 2018 at 9:15 PM, annashcherbina notifications@github.com wrote:

Yes, it seems to basically be overfitting to the training data. Hence I am hoping that some of the other architectures (i.e. those from Surag & Jacob) will be less prone to the overfitting problem.

I will move away from Basset for this dataset.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/kundajelab/gecco-variants/issues/6#issuecomment-389713773, or mute the thread https://github.com/notifications/unsubscribe-auth/AAI7EayzuKZHbYRDIDyKYPkW6ZopHnsqks5tzM89gaJpZM4UB3jb .

annashcherbina commented 6 years ago

I've triple-checked by running the same code on DMSO & het data and reproducing the top-scoring performance on those models with this workflow & the basic bassett architecture. hence, I'm inclined to think it's not a bug.

I can use Surag's code to train the model, that is the next thing on my to-do list.

akundaje commented 6 years ago

Can u post the learning curves for those datasets.

On Wed, May 16, 2018 at 9:26 PM, annashcherbina notifications@github.com wrote:

I've triple-checked by running the same code on DMSO & het data and reproducing the top-scoring performance on those models with this workflow & the basic bassett architecture. hence, I'm inclined to think it's not a bug.

I can use Surag's code to train the model, that is the next thing on my to-do list.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/kundajelab/gecco-variants/issues/6#issuecomment-389715448, or mute the thread https://github.com/notifications/unsubscribe-auth/AAI7ER-Rk4H0ZGBhpYkKr76b_2oTdD-Eks5tzNHOgaJpZM4UB3jb .

annashcherbina commented 6 years ago

the weird thing is that the baseline performance is actually quite good. And the dataset for the baseline model was generated with the same approach to determining negatives as was the het data & DMSO data -- those performance values actually match quite closely.

The new negative set is what's primarily leading to huge drop in auPRC & recallAtFDR50. Is it possible, the new negative set is actually more difficult to learn than the old one?

akundaje commented 6 years ago

It can't be. The original negatives were from other CRC samples, whereas the new negatives are from totally different cell types. There is no way it can be harder than the original set because the negatives will contain totally different motif patterns. Also dinuc and GC matched negatives should be trivial to classify against. You should be getting very high auPRCs for those runs. Thats why I am convinced there is something fundamentally wrong - its not the architecture for sure.

On Wed, May 16, 2018 at 9:28 PM, annashcherbina notifications@github.com wrote:

the weird thing is not that the baseline performance is actually quite good. And the dataset for the baseline model was generated with the same approach to determining negatives as was the het data & DMSO data -- those performance values actually match quite closely.

The new negative set is what's primarily leading to huge drop in auPRC & recallAtFDR50. Is it possible, the new negative set is actually more difficult to learn than the old one?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/kundajelab/gecco-variants/issues/6#issuecomment-389715766, or mute the thread https://github.com/notifications/unsubscribe-auth/AAI7ERHvjyQExilZsZDKURXvOBL0gWryks5tzNJLgaJpZM4UB3jb .

akundaje commented 6 years ago

Lets simplify this. Train the following models.

For a single task

1a and 2a should be trivial to beat. 1b and 2b should be a little harder but still easy to beat.

On Wed, May 16, 2018 at 9:32 PM, Anshul Kundaje anshul@kundaje.net wrote:

It can't be. The original negatives were from other CRC samples, whereas the new negatives are from totally different cell types. There is no way it can be harder than the original set because the negatives will contain totally different motif patterns. Also dinuc and GC matched negatives should be trivial to classify against. You should be getting very high auPRCs for those runs. Thats why I am convinced there is something fundamentally wrong - its not the architecture for sure.

On Wed, May 16, 2018 at 9:28 PM, annashcherbina notifications@github.com wrote:

the weird thing is not that the baseline performance is actually quite good. And the dataset for the baseline model was generated with the same approach to determining negatives as was the het data & DMSO data -- those performance values actually match quite closely.

The new negative set is what's primarily leading to huge drop in auPRC & recallAtFDR50. Is it possible, the new negative set is actually more difficult to learn than the old one?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/kundajelab/gecco-variants/issues/6#issuecomment-389715766, or mute the thread https://github.com/notifications/unsubscribe-auth/AAI7ERHvjyQExilZsZDKURXvOBL0gWryks5tzNJLgaJpZM4UB3jb .

annashcherbina commented 6 years ago

Here's what I'm getting for het model losses with the same code & basic basset. I used ENCODE initializations for the model and it was a multi-tasked model: image

And for DMSO with same code & basic bassett. I also used ENCODE initializations and multi-tasked model:

image

The average performance values across tasks were:

Hets

image image

DMSO

image image

akundaje commented 6 years ago

Its the same problem. The validation loss curves are really bad.

On Wed, May 16, 2018 at 9:59 PM, annashcherbina notifications@github.com wrote:

Here's what I'm getting for het model losses with the same code & basic basset. I used ENCODE initializations for the model and it was a multi-tasked model: [image: image] https://user-images.githubusercontent.com/5261545/40152459-217cb4d0-593a-11e8-9950-7afaa53a9394.png

And for DMSO with same code & basic bassett. I also used ENCODE initializations and multi-tasked model:

[image: image] https://user-images.githubusercontent.com/5261545/40152592-b189babe-593a-11e8-9f0a-14c960a97df1.png

The average performance values across tasks were: Hets

[image: image] https://user-images.githubusercontent.com/5261545/40152711-2404ffea-593b-11e8-890b-750ca9ec5b00.png [image: image] https://user-images.githubusercontent.com/5261545/40152724-352b18f4-593b-11e8-899a-f9780c2e454a.png DMSO

[image: image] https://user-images.githubusercontent.com/5261545/40152641-f0a46ff0-593a-11e8-93a4-c1fc03a02654.png [image: image] https://user-images.githubusercontent.com/5261545/40152690-12852d26-593b-11e8-9273-2c4e910edd5e.png

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/kundajelab/gecco-variants/issues/6#issuecomment-389720292, or mute the thread https://github.com/notifications/unsubscribe-auth/AAI7EWlid8Ao-KPkpV5pawTFhxuIKbTHks5tzNltgaJpZM4UB3jb .

annashcherbina commented 6 years ago

Yes, but it's the best performance we've gotten on these 2 datasets... (also for het, training stopped at epoch 6, i.e. early stopping meant epoch 6 was last one used).

akundaje commented 6 years ago

Lets debug using the GC and dinuc matched negative experiments as I suggested before. If the validation curves on those are not well behaved we know for sure there is something wrong.

On Wed, May 16, 2018 at 10:02 PM, Anshul Kundaje anshul@kundaje.net wrote:

Its the same problem. The validation loss curves are really bad.

On Wed, May 16, 2018 at 9:59 PM, annashcherbina notifications@github.com wrote:

Here's what I'm getting for het model losses with the same code & basic basset. I used ENCODE initializations for the model and it was a multi-tasked model: [image: image] https://user-images.githubusercontent.com/5261545/40152459-217cb4d0-593a-11e8-9950-7afaa53a9394.png

And for DMSO with same code & basic bassett. I also used ENCODE initializations and multi-tasked model:

[image: image] https://user-images.githubusercontent.com/5261545/40152592-b189babe-593a-11e8-9f0a-14c960a97df1.png

The average performance values across tasks were: Hets

[image: image] https://user-images.githubusercontent.com/5261545/40152711-2404ffea-593b-11e8-890b-750ca9ec5b00.png [image: image] https://user-images.githubusercontent.com/5261545/40152724-352b18f4-593b-11e8-899a-f9780c2e454a.png DMSO

[image: image] https://user-images.githubusercontent.com/5261545/40152641-f0a46ff0-593a-11e8-93a4-c1fc03a02654.png [image: image] https://user-images.githubusercontent.com/5261545/40152690-12852d26-593b-11e8-9273-2c4e910edd5e.png

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/kundajelab/gecco-variants/issues/6#issuecomment-389720292, or mute the thread https://github.com/notifications/unsubscribe-auth/AAI7EWlid8Ao-KPkpV5pawTFhxuIKbTHks5tzNltgaJpZM4UB3jb .

annashcherbina commented 6 years ago

By best performance, I mean best performance from all hyperparam searches & applications of bassett to the data. This is why I am fan of moving on to other architectures.

akundaje commented 6 years ago

I can assure you the issue is not architecture here. Its something more fundamental. There is a bug somewhere in the model spec or evaluation code or something else. Loss curves should never look like this. It means there is something fundamentally wrong.

On Wed, May 16, 2018 at 10:05 PM, annashcherbina notifications@github.com wrote:

By best performance, I mean best performance from all hyperparam searches & applications of bassett to the data. This is why I am fan of moving on to other architectures.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/kundajelab/gecco-variants/issues/6#issuecomment-389721124, or mute the thread https://github.com/notifications/unsubscribe-auth/AAI7EeszuIv7GJJ3qSzWtbBC5Bfdvvk6ks5tzNrQgaJpZM4UB3jb .

annashcherbina commented 6 years ago

The models trained on the "easy" datasets achieve near perfect performance (auPRC values in the range of 0.93 - 0.97, recall at FDR50 in range 0.97 - 1.00: image

The corresponding loss curves are: image

The loss curves have similar behavior to those we observed previously.

What this is telling me is that a single epoch of training is sufficient to learn the data --- especially with ENCODE initializations. The ENCODE initializations prove to be very beneficial on these toy datasets, as they increase the auPRC for the 1 neg: 1 pos GC-balanced dataset from 0.84 to 0.97

annashcherbina commented 6 years ago

These are the loss curves when I plot the loss with fewer examples per epoch of training (i.e. one epoch = 5 batches of size 1000).

Are these closer to what we'd expect to see? image

If so, then the reason the previous curves look different is that I use large epochs (i.e. 1 epoch = 700 batches of size 1000). With the large epochs, 1 pass is usually enough to learn the data, and anything further just leads to overfiting.

And just for completeness, the test set accuracy of the model : image

annashcherbina commented 6 years ago

This would explain why the performance (auPRC/recallAtFDR50 ) was the best we've attained to date, with the loss curves showing no improvement beyond the first epoch.

akundaje commented 6 years ago

This looks like a classic case of learning rate is too high.

http://cs231n.github.io/neural-networks-3/

If the network has a strong initialization, you should be using a low learning rate to fine tune the model.

Also do you use early stopping and if so what is the metric you are using for early stopping? Is the performance you are reporting at the end of all epochs in the plots or is it based on the best epoch (e.g. end of epoch 1).

On Thu, May 17, 2018 at 10:06 PM, annashcherbina notifications@github.com wrote:

This would explain why the performance (auPRC/recallAtFDR50 ) was the best we've attained to date, with the loss curves showing no improvement beyond the first epoch.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/kundajelab/gecco-variants/issues/6#issuecomment-390072483, or mute the thread https://github.com/notifications/unsubscribe-auth/AAI7EUcyZFTvf0WHsPn7WrtrrG2TldDxks5tziy9gaJpZM4UB3jb .

annashcherbina commented 6 years ago

I thought that the model achieving auPRC ~ 1 on all four toy datasets is indication that everything is working fine. So I guess I'm confused at what the problem we are trying to solve is? What do we expect the loss curves to look like? I'm not sure I understand why the ones we are observing are different than what usually shows up in the literature?

Early stopping is triggered by no drop in validation loss for 5 consecutive epochs.

The performance I am reporting is based on early stopping (i.e. generally after the first full epoch, i.e. a pass through 700,000 training examples ).

Learning rate used for the "toy" models was 0.001. I have used learning rate values ranging from 0.00001 to 0.01, with no major change in auPRC for the GECCO datasets. I can post performance values for those, but they are within a few percent of the ones above.

annashcherbina commented 6 years ago

For example, all params kept constant, but learning rate is varied:

image

Graph on the left is for LR = 0.001 Graph on the right is for LR = 0.00001

(This is on the full dataset, not the toy datasets). The curves don't improve after first epoch because 700,000 examples is enough to train the model. If we use smaller epochs, we get more traditional looking learning curves with the exponential decay pattern.

annashcherbina commented 6 years ago

Updated learning curves for the negative datasets. The blue box on the learning curve graphs indicates the stopping epoch for which "Trained" accuracy is reported:

GC-balanced, 1 neg: 1 pos

image

GC-balanced, 5 neg: 1 pos

I tried lr = 0.001 and lr=0.0001 for this negative set, though the curve for the lower lr looks more like what we'd like to see, the auPRC is higher for the higher lr. image

Dinucleotide-balanced, 1 neg: 1 pos

image

Dinucleotide-balanced, 5 neg: 1 pos

image

Negatives from (ENCODE - CRC cell types), 10 neg: 1 pos

image

Negatives for V576 DNAse from Scacheri sample matrix, 10 neg: 1 pos

image

akundaje commented 6 years ago

Ok great. That looks good.

Now you can try switching to the residual architecture from Surag. Mahfuza said it gave her a boost on her data as well.

Then we can switch to regression.

-A

On Mon, May 21, 2018 at 12:03 PM, annashcherbina notifications@github.com wrote:

Updated learning curves for the negative datasets. The blue box on the learning curve graphs indicates the stopping epoch for which "Trained" accuracy is reported: GC-balanced, 1 neg: 1 pos

[image: image] https://user-images.githubusercontent.com/5261545/40324265-dd0dfe56-5cec-11e8-80b1-2022024cf3de.png GC-balanced, 5 neg: 5 pos

I tried lr = 0.001 and lr=0.0001 for this negative set, though the curve for the lower lr looks more like what we'd like to see, the auPRC is higher for the higher lr. [image: image] https://user-images.githubusercontent.com/5261545/40324382-351fa9e6-5ced-11e8-9157-8524b8f11630.png Dinucleotide-balanced, 1 neg: 1 pos

[image: image] https://user-images.githubusercontent.com/5261545/40324542-a7cc12e0-5ced-11e8-88a3-405d4c8f78c1.png Dinucleotide-balanced, 5 neg: 1 pos

[image: image] https://user-images.githubusercontent.com/5261545/40324706-28f6c00e-5cee-11e8-9fc5-54c73c621a67.png Negatives from (ENCODE - CRC cell types), 10 neg: 1 pos

[image: image] https://user-images.githubusercontent.com/5261545/40324797-7f7dc350-5cee-11e8-84ee-52da40f509e7.png Negatives for V576 DNAse from Scacheri sample matrix, 10 neg: 1 pos

[image: image] https://user-images.githubusercontent.com/5261545/40324902-db9cf25a-5cee-11e8-99da-c65cd865732e.png

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/kundajelab/gecco-variants/issues/6#issuecomment-390750748, or mute the thread https://github.com/notifications/unsubscribe-auth/AAI7EXwCXnHPKjZz3BkLMT7PIupB1nMoks5t0w91gaJpZM4UB3jb .