kundajelab / gecco-variants

0 stars 0 forks source link

Negative sets for model training #5

Open annashcherbina opened 6 years ago

annashcherbina commented 6 years ago

The current negative sets are leading the model to overfit on training data and to generalize poorly to validation data. The current negative sets are generated by identifying peaks that are absent in a specific CRC sample but present in one or more other CRC samples. In addition, genome regions that are not accessible for any ENCODE cell type are used as negatives. This is not working, most likely because the CRC samples are too similar to one another so peaks present in one sample and absent in another are not answering our question of interest.

image

Next Steps:

All to be repeated in 3 -folds to get variance/ model stability information:

  1. GC-matched negatives from the genome (per Akshay's code)

  2. Dinucleotide -matched negatives from the genome (also per Akshay's code)

  3. Universal negatives -- these are peaks that are present in non-CRC tissues in ENCODE/Roadmap, but absent in all CRC samples

  4. Whole genome training w/ GenomeLake

  5. balanced batches for each negative set

  6. Combine (1-4) above via curriculum learning https://ronan.collobert.com/pub/matos/2009_curriculum_icml.pdf

annashcherbina commented 6 years ago

I have trained a model on V576 DNAse with the following approach:

  1. Begin with Daniel's DNAse peak set from all ENCODE cell types. Remove peaks from colorectal cancer, intestines, and colorectal cancer cell lines.

  2. Extract 1kb regions centered at the peak summits.

  3. Extract 1kb regions centered at the peak summits in the V576 DNAse dataset. Label those as positive. Label the adjoining 1kb regions as ambiguous (i.e. remove from training).

  4. Subtact the 3 kb regions (1kb positive + 2kb ambiguous) in (3) from the combined ENCODE regions in (2). The "difference" regions form the negative set.

  5. Sample the negative set from (4) to achieve a ratio of 10 negatives : 1 positive. (Note to self: the dataset was generate using utils functions from this repository: https://github.com/kundajelab/anna_utils/tree/master/seq_utils)

  6. Train Basset architecture, initializing from ENCODE weights:

image The validation loss did not decrease.

  1. Train Basset architecture, initializing from random weights:

image

The validation loss again did not decrease

  1. Apply Dropout of 0.8 after both dense layers; lowered learning rate from 0.001 to 0.0001 (https://github.com/kundajelab/deeplearning/blob/annashch-branch/gecco/basset_architecture_single_task_RegularizedDropout.py)

image

  1. Added strong L1=0.01 & L2 = 0.01 regularization after 1st and last convolution layer, after 1st dense layer (https://github.com/kundajelab/deeplearning/blob/annashch-branch/gecco/basset_architecture_single_task_RegularizedL1L2.py)

    image

The model is no longer overfitting to the training data, but the performance on validation data was poorer than the original baseline (see table below)

  1. Combined Dropout of 0.8 after 1st and last convolution layer and Regularization L1=0.0001 L2=0.0001 on first and last convolution layer; used ELU activations in between convolution layers: . image

Model performance was: image

Next steps: Use GC & Dinucleotide balanced negatives hyperparameter search on optimal dropout and regularization values.

akundaje commented 6 years ago

Feels like there's a bug somewhere or some parameter is really off. It shouldn't be this hard to prevent overfitting and the performance is also quite abysmal. What's the data quality of this DNase sample. Eg. What's the FRIP score from the pipeline.

Anshul.

On Sun, May 6, 2018, 7:38 PM annashcherbina notifications@github.com wrote:

I have trained a model on V576 DNAse with the following approach:

1.

Begin with Daniel's DNAse peak set from all ENCODE cell types. Remove peaks from colorectal cancer, intestines, and colorectal cancer cell lines. 2.

Extract 1kb regions centered at the peak summits. 3.

Extract 1kb regions centered at the peak summits in the V576 DNAse dataset. Label those as positive. Label the adjoining 1kb regions as ambiguous (i.e. remove from training). 4.

Subtact the 3 kb regions (1kb positive + 2kb ambiguous) in (3) from the combined ENCODE regions in (2). The "difference" regions form the negative set. 5.

Sample the negative set from (4) to achieve a ratio of 10 negatives : 1 positive. (Note to self: the dataset was generate using utils functions from this repository: https://github.com/kundajelab/anna_utils/tree/master/seq_utils) 6.

Train Basset architecture, initializing from ENCODE weights:

[image: image] https://user-images.githubusercontent.com/5261545/39681972-0f784608-5162-11e8-9bef-f473e3267cac.png The validation loss did not decrease.

  1. Train Basset architecture, initializing from random weights:

[image: image] https://user-images.githubusercontent.com/5261545/39681988-25304874-5162-11e8-8aa4-7800e48e4e4e.png

The validation loss again did not decrease

  1. Apply Dropout of 0.8 after both dense layers; lowered learning rate from 0.001 to 0.0001 ( https://github.com/kundajelab/deeplearning/blob/annashch-branch/gecco/basset_architecture_single_task_RegularizedDropout.py )

[image: image] https://user-images.githubusercontent.com/5261545/39682266-c4c5057c-5163-11e8-9b7f-5965ef47496b.png

  1. Added strong L1=0.01 & L2 = 0.01 regularization after 1st and last convolution layer, after 1st dense layer ( https://github.com/kundajelab/deeplearning/blob/annashch-branch/gecco/basset_architecture_single_task_RegularizedL1L2.py )

[image: image] https://user-images.githubusercontent.com/5261545/39682296-ef6980c8-5163-11e8-8cb9-0bd61d1c73da.png

The model is no longer overfitting to the training data, but the performance on validation data was poorer than the original baseline (see table below)

  1. Combined Dropout of 0.8 after 1st and last convolution layer and Regularization L1=0.0001 L2=0.0001 on first and last convolution layer; used ELU activations in between convolution layers: . [image: image] https://user-images.githubusercontent.com/5261545/39682398-73e03b12-5164-11e8-9a06-e7af5e19f90b.png

Model performance was: [image: image] https://user-images.githubusercontent.com/5261545/39682476-02b3b9f4-5165-11e8-9fa5-b81dfcbba5a1.png

Next steps: Use GC & Dinucleotide balanced negatives hyperparameter search on optimal dropout and regularization values.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/kundajelab/gecco-variants/issues/5#issuecomment-386941830, or mute the thread https://github.com/notifications/unsubscribe-auth/AAI7EWupeYFsLGkPKp0k46OBClvqrLv0ks5tv7O8gaJpZM4TyJtr .

akundaje commented 6 years ago

It's mostly classifying everything as negative. So it's not really learning the positives.

On Sun, May 6, 2018, 7:50 PM Anshul Kundaje anshul@kundaje.net wrote:

Feels like there's a bug somewhere or some parameter is really off. It shouldn't be this hard to prevent overfitting and the performance is also quite abysmal. What's the data quality of this DNase sample. Eg. What's the FRIP score from the pipeline.

Anshul.

On Sun, May 6, 2018, 7:38 PM annashcherbina notifications@github.com wrote:

I have trained a model on V576 DNAse with the following approach:

1.

Begin with Daniel's DNAse peak set from all ENCODE cell types. Remove peaks from colorectal cancer, intestines, and colorectal cancer cell lines. 2.

Extract 1kb regions centered at the peak summits. 3.

Extract 1kb regions centered at the peak summits in the V576 DNAse dataset. Label those as positive. Label the adjoining 1kb regions as ambiguous (i.e. remove from training). 4.

Subtact the 3 kb regions (1kb positive + 2kb ambiguous) in (3) from the combined ENCODE regions in (2). The "difference" regions form the negative set. 5.

Sample the negative set from (4) to achieve a ratio of 10 negatives : 1 positive. (Note to self: the dataset was generate using utils functions from this repository: https://github.com/kundajelab/anna_utils/tree/master/seq_utils) 6.

Train Basset architecture, initializing from ENCODE weights:

[image: image] https://user-images.githubusercontent.com/5261545/39681972-0f784608-5162-11e8-9bef-f473e3267cac.png The validation loss did not decrease.

  1. Train Basset architecture, initializing from random weights:

[image: image] https://user-images.githubusercontent.com/5261545/39681988-25304874-5162-11e8-8aa4-7800e48e4e4e.png

The validation loss again did not decrease

  1. Apply Dropout of 0.8 after both dense layers; lowered learning rate from 0.001 to 0.0001 ( https://github.com/kundajelab/deeplearning/blob/annashch-branch/gecco/basset_architecture_single_task_RegularizedDropout.py )

[image: image] https://user-images.githubusercontent.com/5261545/39682266-c4c5057c-5163-11e8-9b7f-5965ef47496b.png

  1. Added strong L1=0.01 & L2 = 0.01 regularization after 1st and last convolution layer, after 1st dense layer ( https://github.com/kundajelab/deeplearning/blob/annashch-branch/gecco/basset_architecture_single_task_RegularizedL1L2.py )

[image: image] https://user-images.githubusercontent.com/5261545/39682296-ef6980c8-5163-11e8-8cb9-0bd61d1c73da.png

The model is no longer overfitting to the training data, but the performance on validation data was poorer than the original baseline (see table below)

  1. Combined Dropout of 0.8 after 1st and last convolution layer and Regularization L1=0.0001 L2=0.0001 on first and last convolution layer; used ELU activations in between convolution layers: . [image: image] https://user-images.githubusercontent.com/5261545/39682398-73e03b12-5164-11e8-9a06-e7af5e19f90b.png

Model performance was: [image: image] https://user-images.githubusercontent.com/5261545/39682476-02b3b9f4-5165-11e8-9fa5-b81dfcbba5a1.png

Next steps: Use GC & Dinucleotide balanced negatives hyperparameter search on optimal dropout and regularization values.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/kundajelab/gecco-variants/issues/5#issuecomment-386941830, or mute the thread https://github.com/notifications/unsubscribe-auth/AAI7EWupeYFsLGkPKp0k46OBClvqrLv0ks5tv7O8gaJpZM4TyJtr .

akundaje commented 6 years ago

Actually scratch that. That's just the second model in the table that isn't learning anything. Model3 is learning but something seems off. The GC and dinuc negatives should reveal any serious issues with the learning because it should be trivial to conquer them.

On Sun, May 6, 2018, 7:52 PM Anshul Kundaje anshul@kundaje.net wrote:

It's mostly classifying everything as negative. So it's not really learning the positives.

On Sun, May 6, 2018, 7:50 PM Anshul Kundaje anshul@kundaje.net wrote:

Feels like there's a bug somewhere or some parameter is really off. It shouldn't be this hard to prevent overfitting and the performance is also quite abysmal. What's the data quality of this DNase sample. Eg. What's the FRIP score from the pipeline.

Anshul.

On Sun, May 6, 2018, 7:38 PM annashcherbina notifications@github.com wrote:

I have trained a model on V576 DNAse with the following approach:

1.

Begin with Daniel's DNAse peak set from all ENCODE cell types. Remove peaks from colorectal cancer, intestines, and colorectal cancer cell lines. 2.

Extract 1kb regions centered at the peak summits. 3.

Extract 1kb regions centered at the peak summits in the V576 DNAse dataset. Label those as positive. Label the adjoining 1kb regions as ambiguous (i.e. remove from training). 4.

Subtact the 3 kb regions (1kb positive + 2kb ambiguous) in (3) from the combined ENCODE regions in (2). The "difference" regions form the negative set. 5.

Sample the negative set from (4) to achieve a ratio of 10 negatives : 1 positive. (Note to self: the dataset was generate using utils functions from this repository: https://github.com/kundajelab/anna_utils/tree/master/seq_utils) 6.

Train Basset architecture, initializing from ENCODE weights:

[image: image] https://user-images.githubusercontent.com/5261545/39681972-0f784608-5162-11e8-9bef-f473e3267cac.png The validation loss did not decrease.

  1. Train Basset architecture, initializing from random weights:

[image: image] https://user-images.githubusercontent.com/5261545/39681988-25304874-5162-11e8-8aa4-7800e48e4e4e.png

The validation loss again did not decrease

  1. Apply Dropout of 0.8 after both dense layers; lowered learning rate from 0.001 to 0.0001 ( https://github.com/kundajelab/deeplearning/blob/annashch-branch/gecco/basset_architecture_single_task_RegularizedDropout.py )

[image: image] https://user-images.githubusercontent.com/5261545/39682266-c4c5057c-5163-11e8-9b7f-5965ef47496b.png

  1. Added strong L1=0.01 & L2 = 0.01 regularization after 1st and last convolution layer, after 1st dense layer ( https://github.com/kundajelab/deeplearning/blob/annashch-branch/gecco/basset_architecture_single_task_RegularizedL1L2.py )

[image: image] https://user-images.githubusercontent.com/5261545/39682296-ef6980c8-5163-11e8-8cb9-0bd61d1c73da.png

The model is no longer overfitting to the training data, but the performance on validation data was poorer than the original baseline (see table below)

  1. Combined Dropout of 0.8 after 1st and last convolution layer and Regularization L1=0.0001 L2=0.0001 on first and last convolution layer; used ELU activations in between convolution layers: . [image: image] https://user-images.githubusercontent.com/5261545/39682398-73e03b12-5164-11e8-9a06-e7af5e19f90b.png

Model performance was: [image: image] https://user-images.githubusercontent.com/5261545/39682476-02b3b9f4-5165-11e8-9fa5-b81dfcbba5a1.png

Next steps: Use GC & Dinucleotide balanced negatives hyperparameter search on optimal dropout and regularization values.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/kundajelab/gecco-variants/issues/5#issuecomment-386941830, or mute the thread https://github.com/notifications/unsubscribe-auth/AAI7EWupeYFsLGkPKp0k46OBClvqrLv0ks5tv7O8gaJpZM4TyJtr .