kundajelab / gecco-variants

0 stars 0 forks source link

Investigation of difference in performance between randomly initialized and ENCODE-initialized models (gc-balanced negatives, 1 neg: 1 pos ratio) #8

Open annashcherbina opened 6 years ago

annashcherbina commented 6 years ago

1

image

It looks like the randomly-initialized model is calling a high number of false positives: image

2

I performed several HOMER analyses to determine which motifs show up as false positives in the randomly initialized model but not in the ENCODE-initialized model. The columns in the resulting excel file (https://drive.google.com/open?id=1BBA1USo87gmauo3W9b93mwzQLHoZKJnZ) are labeled as:

  1. Motifs enriched in the V576 DNase peaks compared to hg19 background
  2. Motifs enriched in the gc-balanced 1:1 positive training set compared to a background of the gc-balanced 1:1 negative training set
  3. Motifs enriched in the gc-balanced 1:1 positive test set compared to a background of the gc-balanced 1:1 negative test set
  4. Motifs enriched in ENCODE-init model positive predictions vs ENCODE-init negative predictions
  5. Motifs enriched in random-init model positive predictions vs random-init negative predictions
  6. Motifs enriched in false positives in the random-init model vs background of true positives
  7. Motifs enriched in false positives in the random-init model vs background of false positives in ENCODE-init model.

Tab 1 (Green): The most significant motifs enriched in V576 vs hg19 background; learned perfectly by both random-init & ENCODE-init models: image

Tab 2 (blue): Motifs not enriched in positive examples in the training and test sets, but enriched in the model predictions of positives vs negatives. image

image

Tab 3 (gray) : motifs enriched in V576 vs hg19, but not enriched in our train/test sets -- unclear importance. image

Tab 4 (yellow): motifs enriched in train/test sets, but not learned by ENCODE-init model, random-init model, or both image

Tab 5 (orange): motifs that are enriched in false positives and true positive examples in the randomly-initialized model: image

Tab 6: (red): motifs that are enriched only in false positive predictions in the randomly-initialized models -- this is the group we primarily want to target to improve agreement between the ENCODE-init & random-init models. There are a number of Sp motifs & Zinc fingers in this group: image image image

3

Running gradient x input on the trained ENCODE-initialized and randomly-initialized models yields: https://github.com/kundajelab/deeplearning/blob/annashch-branch/gecco/gc.1pos.1neg/gradient_inputs.ipynb The main takeaways from this are the much stronger motif patterns in the ENCODE-initialized model. There is also much overlap between columns 6 & 7 above & the grad x input tracks. For example:

image

4

I added 2000 negatives for the following motifs: TCF3, KLF4, SPZ1, OBOX5, PAX8 The reasoning is that these are enriched in the false positive predictions of the randomly initialized model. Augmenting the training set with negative regions that contain these motifs might help.

The resulting performance for the models: image

So while there was no overall change in auPRC for either the ENCODE-init or random-init models, the false positive rate of the random-initialized model dropped by 5%.

We observe 1101 fewer false positive examples specific to the random-initialized model: image We do observe 598 additional false negatives specific to the randomly-initialized model.

So overall, this approach did help with the false positive problem.

5 Future steps:

Additional data augmentation of training set with hard-to-learn motifs (i.e. those that show up as false positives in randomly initialized model)

Add specific tasks to the model focused on learning the difficult motifs. (Currently the model is single-tasked and just learns peak presence/absence in the V576.bed file).

annashcherbina commented 6 years ago

Adding tasks to the model to learn motifs enriched in the randomly-initialized model's false positive set did not lead to an improvement in auPRC on the main tasks:

image