Investigation of difference in performance between randomly initialized and ENCODE-initialized models (gc-balanced negatives, 1 neg: 1 pos ratio)

1

It looks like the randomly-initialized model is calling a high number of false positives:

2

I performed several HOMER analyses to determine which motifs show up as false positives in the randomly initialized model but not in the ENCODE-initialized model. The columns in the resulting excel file (https://drive.google.com/open?id=1BBA1USo87gmauo3W9b93mwzQLHoZKJnZ) are labeled as:

Motifs enriched in the V576 DNase peaks compared to hg19 background
Motifs enriched in the gc-balanced 1:1 positive training set compared to a background of the gc-balanced 1:1 negative training set
Motifs enriched in the gc-balanced 1:1 positive test set compared to a background of the gc-balanced 1:1 negative test set
Motifs enriched in ENCODE-init model positive predictions vs ENCODE-init negative predictions
Motifs enriched in random-init model positive predictions vs random-init negative predictions
Motifs enriched in false positives in the random-init model vs background of true positives
Motifs enriched in false positives in the random-init model vs background of false positives in ENCODE-init model.

Tab 1 (Green): The most significant motifs enriched in V576 vs hg19 background; learned perfectly by both random-init & ENCODE-init models:

Tab 2 (blue): Motifs not enriched in positive examples in the training and test sets, but enriched in the model predictions of positives vs negatives.

Tab 3 (gray) : motifs enriched in V576 vs hg19, but not enriched in our train/test sets -- unclear importance.

Tab 4 (yellow): motifs enriched in train/test sets, but not learned by ENCODE-init model, random-init model, or both

Tab 5 (orange): motifs that are enriched in false positives and true positive examples in the randomly-initialized model:

Tab 6: (red): motifs that are enriched only in false positive predictions in the randomly-initialized models -- this is the group we primarily want to target to improve agreement between the ENCODE-init & random-init models. There are a number of Sp motifs & Zinc fingers in this group:

3

Running gradient x input on the trained ENCODE-initialized and randomly-initialized models yields: https://github.com/kundajelab/deeplearning/blob/annashch-branch/gecco/gc.1pos.1neg/gradient_inputs.ipynb The main takeaways from this are the much stronger motif patterns in the ENCODE-initialized model. There is also much overlap between columns 6 & 7 above & the grad x input tracks. For example:

4

I added 2000 negatives for the following motifs: TCF3, KLF4, SPZ1, OBOX5, PAX8 The reasoning is that these are enriched in the false positive predictions of the randomly initialized model. Augmenting the training set with negative regions that contain these motifs might help.

The resulting performance for the models:

So while there was no overall change in auPRC for either the ENCODE-init or random-init models, the false positive rate of the random-initialized model dropped by 5%.

We observe 1101 fewer false positive examples specific to the random-initialized model: We do observe 598 additional false negatives specific to the randomly-initialized model.

So overall, this approach did help with the false positive problem.

5 Future steps:

Additional data augmentation of training set with hard-to-learn motifs (i.e. those that show up as false positives in randomly initialized model)

Add specific tasks to the model focused on learning the difficult motifs. (Currently the model is single-tasked and just learns peak presence/absence in the V576.bed file).

kundajelab / gecco-variants

Investigation of difference in performance between randomly initialized and ENCODE-initialized models (gc-balanced negatives, 1 neg: 1 pos ratio) #8

1

2

3

4

5 Future steps: