kundajelab / gecco-variants

0 stars 0 forks source link

Input optimizations: Personal variants, global allele frequencies #3

Open annashcherbina opened 6 years ago

annashcherbina commented 6 years ago

This is what we discussed at the meeting today. I am just documenting on github issues for future reference. image

image

image

Next Steps:

Train on Oana's LCL QTL data

To resolve: Should allele frequencies be normalized for population (i.e.should PC1 in pop-strat PCA be included as a covariate in the deep learning model, or should the input counts be normalized a priori? Plan is to try both approaches. The model is learning both local and global effects, and it makes sense to correct for population freqs for local effects but not global effects.

akundaje commented 6 years ago

The issue isn't about allele frequencies being normalized access populations rather whether the histone/DNase data needs to be corrected for global variance in population structure (Peer factors). We can discuss this at length next time.

On Thu, May 3, 2018, 10:47 PM annashcherbina notifications@github.com wrote:

This is what we discussed at the meeting today. I am just documenting on github issues for future reference. V576_H3K27ac (Task #19) V576_DNAse (Task #28) Metric Baseline Subject Variants recallAtFDR50 0.33 0.34 recallAtFDR20 0.14 0.14 auroc_vals 0.81 0.82 auprc_vals 0.39 0.40 unbalanced_accuracy_vals 0.90 0.89 balanced_accuracy_vals 0.71 0.72 positives_accuracy_vals 0.50 0.52 negatives_accuracy_vals 0.92 0.92 num_positive_vals in test set 10705 10705 num_negative_vals in test set 151614 151614

[image: image] https://user-images.githubusercontent.com/5261545/39614321-dad14e80-4f23-11e8-848b-7d6c4e510e7e.png

[image: image] https://user-images.githubusercontent.com/5261545/39614333-e49f4994-4f23-11e8-8626-7ef8d3d955c4.png

Next Steps:

Train on Oana's LCL QTL data

  • Phased allele frequencies in 8 channel input matrix
  • Non-phased allele frequencies in 4 channel input matrix
  • concatenate all 76 genomes to train a single-tasked model
  • Normalized counts as output (i.e. regression model) with Poisson loss and binomial loss

To resolve: Should allele frequencies be normalized for population (i.e.should PC1 in pop-strat PCA be included as a covariate in the deep learning model, or should the input counts be normalized a priori? Plan is to try both approaches. The model is learning both local and global effects, and it makes sense to correct for population freqs for local effects but not global effects.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/kundajelab/gecco-variants/issues/3, or mute the thread https://github.com/notifications/unsubscribe-auth/AAI7EZAwXLW6O8iCTdiTf5BjKVEIy7XRks5tu-tlgaJpZM4TyI2H .