SuRE-seq: Initial Deep Models for MPRA Regression and Classification

kundajelab / mpra

Deep learning MPRAs

9 stars 1 forks source link

SuRE-seq: Initial Deep Models for MPRA Regression and Classification #2

Closed rmovva closed 7 years ago

rmovva commented 7 years ago

26 June 2017

My first deep learning models finished training today! For the first pass, I trained a regression model using the 2.2M fragments that were positive across both of the SuRE replicates (Spearman corr = 0.34). I didn't do any curation to select a subsample of these 2.2M fragments, since (1) I ran IDR on this set, and as noted here, there were no clear "signal" and "noise" clusters within this group (that is - the fragments that were most reproducible according to IDR did not have a better Spearman correlation than the rest of the fragments). And (2), using fewer than 2M fragments would probably worsen performance given the diversity of the sequences (basically sampled i.i.d. from the genome).

The 2.2M fragments were split into 50K validation (chr18), 100K test (chr8), and 2M training (chrs1-22 except chr18 and chr8). I've left out chrX and chrY for now, for no particular reason; I could probably use them. Since most of the sequences were fewer than 2000bp, but the maximum size was 2000bp (a small number of fragments were >2kb, but I left them out), I centered the sequences and zero-padded the sides. The model has a Basset-like architecture (model summary is below) and is multitasked on average signal, replicate 1 signal, and replicate 2 signal.

I trained with a epoch size of 200K, batch size 500, and early stopping when performance hasn't improved for 5 epochs. Over a few runs, the model generally takes 6-7 optimize and then plateaus. I used Spearman correlation between predictions and signals as the primary metric. Validation performance, averaged across the three tasks, was Spearman corr = ~0.23 (training performance was 0.314). For the individual tasks, validation scores were 0.256, 0.126, and 0.301 for average signal, rep1 signal, and rep2 signal respectively.

@akundaje Here are some questions/concerns I have based on these initial results:

Can a better architecture improve performance nontrivially? Is a hyperparameter search worth doing?
Among the three tasks, replicate 1 signal is by far the hardest to predict, with prediction Spearman correlation maxing out at approx 0.12 to 0.13. Replicate 2 signal is ~0.30. Would it make sense to just try and optimize performance on replicate 2, and remove replicate 1 from further analysis? Replicate 1 also has 96.5% zero signal vs. 85% zero signal in replicate 2, so it seems like replicate 1 went poorly for some reason.
Given that I'm already getting 0.23 correlation with average and 0.30 correlation with replicate 2, and corr(rep1, rep2) = 0.34, how much more room is there to go with the model? If 0.34 is the biological upper bound, I imagine there's some things that the model cannot learn purely from sequence features.
I think we should be more concerned about the quality of this data, especially for interpretation. I was talking to @jisraeli earlier today about interpreting models, and he showed me one of his simulations where Deeplift couldn't identify the true biological motif feature until the model had achieved AUROC 0.994 (there were still confounding factors at AUROC = 0.98). Given that we're going to see, at best, Spearman = 0.3-0.4 with this model, is it even worth spending time trying to interpret these models? Should I move to another dataset?
Sequence strandedness: some of the fragments that were assayed were from the negative strand - I think this means that they were in the opposite direction when put in the plasmid. I'm using these sequences in the training set, but I think most of the negative strand fragments had lower signal – since most promoter-driving regulatory grammars are on the positive strand, AFAIK. Also, I think negative stranded fragments may be less reproducible, and treating them as separate also prevents me from doing reverse complement augmentation. So any insight on how to handle strandedness would be helpful.

Current model architecture: screen shot 2017-06-27 at 3 13 59 pm

akundaje commented 7 years ago

DeepLIFTs lack of picking up a motif is more than often an issue with the reference rather than requiring crazy high accuracy. But also true that deep learning models in general seem to learn the minimal set of features required to get the prediction correct. So won't always reveal redundant but biologically important features. I would still try some basic interpretation. E.g. Run DeepLIFT and gradient*input and ISM on top scoring sequences. And interpret the filters as you are currently doing.

On Tue, Jun 27, 2017 at 3:32 PM, Rajiv Movva notifications@github.com wrote:

26 June 2017

My first deep learning models finished training today! For the first pass, I trained a regression model using the 2.2M fragments that were positive across both of the SuRE replicates (Spearman corr = 0.34). I didn't do any curation to select a subsample of these 2.2M fragments, since (1) I ran IDR on this set, and as noted here https://github.com/kundajelab/mpra/issues/1#issuecomment-310828047, there were no clear "signal" and "noise" clusters within this group (that is - the fragments that were most reproducible according to IDR did not have a better Spearman correlation than the rest of the fragments). And (2), using fewer than 2M fragments would probably worsen performance given the diversity of the sequences (basically sampled i.i.d. from the genome).

The 2.2M fragments were split into 50K validation (chr18), 100K test (chr8), and 2M training (chrs1-22 except chr18 and chr8). I've left out chrX and chrY for now, for no particular reason; I could probably use them. The model has a Basset-like architecture (model summary is below) that's multitasked on average signal, replicate 1 signal, and replicate 2 signal.

I trained with a epoch size of 200K, batch size 500, and early stopping when performance hasn't improved for 5 epochs. Over a few runs, the model generally takes 6-7 optimize and then plateaus. I used Spearman correlation between predictions and signals as the primary metric. Validation performance, averaged across the three tasks, was Spearman corr = ~0.23 (training performance was 0.314). For the individual tasks, validation scores were 0.256, 0.126, and 0.301 for average signal, rep1 signal, and rep2 signal respectively.

@akundaje https://github.com/akundaje Here are some questions/concerns I have based on these initial results:

Can a better architecture improve performance nontrivially? Is a hyperparameter search worth doing?

Among the three tasks, replicate 1 signal is by far the hardest to predict, with prediction Spearman correlation maxing out at approx 0.12 to 0.13. Replicate 2 signal is ~0.30. Would it make sense to just try and optimize performance on replicate 2, and remove replicate 1 from further analysis? Replicate 1 also has 96.5% zero signal vs. 85% zero signal in replicate 2, so it seems like replicate 1 went poorly for some reason.

Given that I'm already getting 0.23 correlation with average and 0.30 correlation with replicate 2, and corr(rep1, rep2) = 0.34, how much more room is there to go with the model? If 0.34 is the biological upper bound, I imagine there's some things that the model cannot learn with solely sequence information.

I think we should be more concerned about the quality of this data, especially for interpretation. I was talking to @jisraeli https://github.com/jisraeli earlier today about interpreting models, and he showed me one of his simulations where Deeplift couldn't identify the true biological motif feature until the model had achieved AUROC 0.994 (there were still confounding factors at AUROC = 0.98). Given that we're going to see, at best, Spearman = 0.3-0.4 with this model, is it even worth spending time trying to interpret these models? Should I move to another dataset?

Sequence strandedness: some of the fragments that were assayed were from the negative strand - I think this means that they were in the opposite direction when put in the plasmid. I'm using these sequences in the training set, but I think most of the negative strand fragments had lower signal – since most promoter-driving regulatory grammars are on the positive strand, AFAIK. Also, I think negative stranded fragments may be less reproducible, and treating them as separate also prevents me from doing reverse complement augmentation. So any insight on how to handle strandedness would be helpful.

Current model architecture: [image: screen shot 2017-06-27 at 3 13 59 pm] https://user-images.githubusercontent.com/17256372/27612438-56d814f0-5b4b-11e7-8c1d-21e8ae8272d8.png

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kundajelab/mpra/issues/2, or mute the thread https://github.com/notifications/unsubscribe-auth/AAI7EbiV3jz59vR-NrJ2n-MifU1LUrg3ks5sIYL3gaJpZM4OHRkv .

akundaje commented 7 years ago

Can you also plot a scatter plot of predicted and true values for the training set and validation set.

-Anshul.

On Thu, Jun 29, 2017 at 5:57 PM, Anshul Kundaje anshul@kundaje.net wrote:

DeepLIFTs lack of picking up a motif is more than often an issue with the reference rather than requiring crazy high accuracy. But also true that deep learning models in general seem to learn the minimal set of features required to get the prediction correct. So won't always reveal redundant but biologically important features. I would still try some basic interpretation. E.g. Run DeepLIFT and gradient*input and ISM on top scoring sequences. And interpret the filters as you are currently doing.

On Tue, Jun 27, 2017 at 3:32 PM, Rajiv Movva notifications@github.com wrote:

26 June 2017

My first deep learning models finished training today! For the first pass, I trained a regression model using the 2.2M fragments that were positive across both of the SuRE replicates (Spearman corr = 0.34). I didn't do any curation to select a subsample of these 2.2M fragments, since (1) I ran IDR on this set, and as noted here https://github.com/kundajelab/mpra/issues/1#issuecomment-310828047, there were no clear "signal" and "noise" clusters within this group (that is - the fragments that were most reproducible according to IDR did not have a better Spearman correlation than the rest of the fragments). And (2), using fewer than 2M fragments would probably worsen performance given the diversity of the sequences (basically sampled i.i.d. from the genome).

The 2.2M fragments were split into 50K validation (chr18), 100K test (chr8), and 2M training (chrs1-22 except chr18 and chr8). I've left out chrX and chrY for now, for no particular reason; I could probably use them. The model has a Basset-like architecture (model summary is below) that's multitasked on average signal, replicate 1 signal, and replicate 2 signal.

I trained with a epoch size of 200K, batch size 500, and early stopping when performance hasn't improved for 5 epochs. Over a few runs, the model generally takes 6-7 optimize and then plateaus. I used Spearman correlation between predictions and signals as the primary metric. Validation performance, averaged across the three tasks, was Spearman corr = ~0.23 (training performance was 0.314). For the individual tasks, validation scores were 0.256, 0.126, and 0.301 for average signal, rep1 signal, and rep2 signal respectively.

@akundaje https://github.com/akundaje Here are some questions/concerns I have based on these initial results:

Can a better architecture improve performance nontrivially? Is a hyperparameter search worth doing?

Among the three tasks, replicate 1 signal is by far the hardest to predict, with prediction Spearman correlation maxing out at approx 0.12 to 0.13. Replicate 2 signal is ~0.30. Would it make sense to just try and optimize performance on replicate 2, and remove replicate 1 from further analysis? Replicate 1 also has 96.5% zero signal vs. 85% zero signal in replicate 2, so it seems like replicate 1 went poorly for some reason.

Given that I'm already getting 0.23 correlation with average and 0.30 correlation with replicate 2, and corr(rep1, rep2) = 0.34, how much more room is there to go with the model? If 0.34 is the biological upper bound, I imagine there's some things that the model cannot learn with solely sequence information.

I think we should be more concerned about the quality of this data, especially for interpretation. I was talking to @jisraeli https://github.com/jisraeli earlier today about interpreting models, and he showed me one of his simulations where Deeplift couldn't identify the true biological motif feature until the model had achieved AUROC 0.994 (there were still confounding factors at AUROC = 0.98). Given that we're going to see, at best, Spearman = 0.3-0.4 with this model, is it even worth spending time trying to interpret these models? Should I move to another dataset?

Sequence strandedness: some of the fragments that were assayed were from the negative strand - I think this means that they were in the opposite direction when put in the plasmid. I'm using these sequences in the training set, but I think most of the negative strand fragments had lower signal – since most promoter-driving regulatory grammars are on the positive strand, AFAIK. Also, I think negative stranded fragments may be less reproducible, and treating them as separate also prevents me from doing reverse complement augmentation. So any insight on how to handle strandedness would be helpful.

Current model architecture: [image: screen shot 2017-06-27 at 3 13 59 pm] https://user-images.githubusercontent.com/17256372/27612438-56d814f0-5b4b-11e7-8c1d-21e8ae8272d8.png

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kundajelab/mpra/issues/2, or mute the thread https://github.com/notifications/unsubscribe-auth/AAI7EbiV3jz59vR-NrJ2n-MifU1LUrg3ks5sIYL3gaJpZM4OHRkv .

akundaje commented 7 years ago

I'm not sure I understand the negative strand issue. When they say negative strand I am guessing they are using the reverse and not the reverse complement. Reverse complement would give same effect as the positive strand sequence. Reverse would not. So it should not really affect using reverse complements in any way. You either use the + strand sequence and its rev comp. or if they say they use the - strand sequence I believe it would mean u use the reverse of the + strand sequence and its rev. comp.

On Thu, Jun 29, 2017 at 6:00 PM, Anshul Kundaje anshul@kundaje.net wrote:

Can you also plot a scatter plot of predicted and true values for the training set and validation set.

-Anshul.

On Thu, Jun 29, 2017 at 5:57 PM, Anshul Kundaje anshul@kundaje.net wrote:

DeepLIFTs lack of picking up a motif is more than often an issue with the reference rather than requiring crazy high accuracy. But also true that deep learning models in general seem to learn the minimal set of features required to get the prediction correct. So won't always reveal redundant but biologically important features. I would still try some basic interpretation. E.g. Run DeepLIFT and gradient*input and ISM on top scoring sequences. And interpret the filters as you are currently doing.

On Tue, Jun 27, 2017 at 3:32 PM, Rajiv Movva notifications@github.com wrote:

26 June 2017

My first deep learning models finished training today! For the first pass, I trained a regression model using the 2.2M fragments that were positive across both of the SuRE replicates (Spearman corr = 0.34). I didn't do any curation to select a subsample of these 2.2M fragments, since (1) I ran IDR on this set, and as noted here https://github.com/kundajelab/mpra/issues/1#issuecomment-310828047, there were no clear "signal" and "noise" clusters within this group (that is - the fragments that were most reproducible according to IDR did not have a better Spearman correlation than the rest of the fragments). And (2), using fewer than 2M fragments would probably worsen performance given the diversity of the sequences (basically sampled i.i.d. from the genome).

The 2.2M fragments were split into 50K validation (chr18), 100K test (chr8), and 2M training (chrs1-22 except chr18 and chr8). I've left out chrX and chrY for now, for no particular reason; I could probably use them. The model has a Basset-like architecture (model summary is below) that's multitasked on average signal, replicate 1 signal, and replicate 2 signal.

I trained with a epoch size of 200K, batch size 500, and early stopping when performance hasn't improved for 5 epochs. Over a few runs, the model generally takes 6-7 optimize and then plateaus. I used Spearman correlation between predictions and signals as the primary metric. Validation performance, averaged across the three tasks, was Spearman corr = ~0.23 (training performance was 0.314). For the individual tasks, validation scores were 0.256, 0.126, and 0.301 for average signal, rep1 signal, and rep2 signal respectively.

@akundaje https://github.com/akundaje Here are some questions/concerns I have based on these initial results:

Can a better architecture improve performance nontrivially? Is a hyperparameter search worth doing?

Among the three tasks, replicate 1 signal is by far the hardest to predict, with prediction Spearman correlation maxing out at approx 0.12 to 0.13. Replicate 2 signal is ~0.30. Would it make sense to just try and optimize performance on replicate 2, and remove replicate 1 from further analysis? Replicate 1 also has 96.5% zero signal vs. 85% zero signal in replicate 2, so it seems like replicate 1 went poorly for some reason.

Given that I'm already getting 0.23 correlation with average and 0.30 correlation with replicate 2, and corr(rep1, rep2) = 0.34, how much more room is there to go with the model? If 0.34 is the biological upper bound, I imagine there's some things that the model cannot learn with solely sequence information.

I think we should be more concerned about the quality of this data, especially for interpretation. I was talking to @jisraeli https://github.com/jisraeli earlier today about interpreting models, and he showed me one of his simulations where Deeplift couldn't identify the true biological motif feature until the model had achieved AUROC 0.994 (there were still confounding factors at AUROC = 0.98). Given that we're going to see, at best, Spearman = 0.3-0.4 with this model, is it even worth spending time trying to interpret these models? Should I move to another dataset?

Sequence strandedness: some of the fragments that were assayed were from the negative strand - I think this means that they were in the opposite direction when put in the plasmid. I'm using these sequences in the training set, but I think most of the negative strand fragments had lower signal – since most promoter-driving regulatory grammars are on the positive strand, AFAIK. Also, I think negative stranded fragments may be less reproducible, and treating them as separate also prevents me from doing reverse complement augmentation. So any insight on how to handle strandedness would be helpful.

Current model architecture: [image: screen shot 2017-06-27 at 3 13 59 pm] https://user-images.githubusercontent.com/17256372/27612438-56d814f0-5b4b-11e7-8c1d-21e8ae8272d8.png

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kundajelab/mpra/issues/2, or mute the thread https://github.com/notifications/unsubscribe-auth/AAI7EbiV3jz59vR-NrJ2n-MifU1LUrg3ks5sIYL3gaJpZM4OHRkv .

rmovva commented 7 years ago

After talking to Anshul and Peyton, these are the next steps to move forward:

DeepLIFT / Grad*Input / ISM on a few top scoring sequences. See which elements pop out - TATA box, other consensus promoter grammars, TF binding sites, etc. Compare interpretation signal vs. genome browser tracks.
Understand conv filters better - look at 3 filter model. At some point, write the Basset PWM generation code, and integrate Peyton’s motif comparison code with momma_dragonn.
Look at performance at GENCODE TSS regions: should be better, and hopefully will be correlated to RAMPAGE gene expression. Also look at the assay’s replicate correlation for these higher-quality fragments.
Train model on only replicate 2 signal, since that's performing better than the average prediction task and the replicate 1 prediction task.
Try overfitting to the training set, to confirm that there are learnable features in this dataset - use a big Basset-like model with no batchnorm, no dropout, and no regularization.
Train a random forest on nucleotide features - global GC/AT content, also windowed compositional features

Longer-term goals - i.e., what impactful contributions could an MPRA model add to the field?

Are there any tricks that give large performance boosts? Transfer learning could help here. E.g., pretrain on RNAseq gene expression / RAMPAGE promoter strength, accessibility, other MPRA datasets, etc.
Interpretation - lots of MPRA assays have been done, but there's still lots of space to improve on decoding regulatory grammars. Ideally, we'll propose new hypotheses, and actually apply accurate models to test these hypotheses. One cool application would be to train a decision tree on DeepLIFT / other induced features; the learned rules could be pretty interesting.

rmovva commented 7 years ago

Over the last week I've focused mostly on interpretation of models trained on the SuRE-seq dataset. Here are some broad updates and next steps for the project:

Model performance. My regression model predictions were able to recapitulate replicate correlations when evaluated on a validation set. That is, spearmanr(rep1, rep2) = spearmanr(y_pred_validation, rep2) = 0.34. Notably, the model's prediction Spearman correlation was much better for replicate 2 than replicate 1 (0.34 vs. 0.14), so I stopped multitasking (on average, rep1, rep2; even predictions for average were less accurate than predictions for rep2) and trained only on replicate 2 signal.
Filter interpretation. I used the PWM cross-correlation and the Basset sequence activation methods for conv filter interpretation. Pretty much all the filters appeared to search for low-order sequence composition features (GC counting, etc. Some examples shown below.). I put the PWMs generated with the Basset method in TOMTOM, and there weren't any strong hits. However, since I had 100 filters, it's probably worth trying MODISCO to cluster some of them and then see if there are any matches to TF motifs.
DeepLIFT sequence interpretation. Most of the fragments that were both (1) highly scoring and (2) accurately scored by the model were in retrotransposon elements (LTR12C). I compared DeepLIFT tracks for these elements to various accessibility/TF/chromHMM tracks on the WashU browser, and generally speaking there were few clear patterns / causal things that popped out. That said, I should look more closely at per-base resolution of deepLIFT scores (also grad*input, which I haven't checked yet) and see if there relevant TF motifs etc. I'm also going to look specifically at highly scoring and accurately predicted Tss/Enh chromHMM states because those will probably be more interesting than these repeat elements.
RAMPAGE regression. Data quality is better for fragments near GENCODE TSS's. Replicate spearman goes up to ~0.50, and model predictions are near that level as well (actually higher, 0.54). However, I tried predicting RAMPAGE TSS expression (ENCODE QC'd, from Thomas Gingeras) by taking the ~100bp TSS sequences + ~100bp downstream and ~700bp upstream (because ~900bp is the average SuRE fragment length), zero-padding the sides, and putting them through the model, and those predictions only had Spearman 0.16 with RAMPAGE scores (n = 411; I only included TSSs for chr8 and chr18, which were held out from training). Even when I tried predicting genome-wide (so, most of the TSSs would've been overlapped by some of the training SuRE fragments except those in chrs 8 and 18), the Spearman was only around 0.27 (n = 8584). The assay, however, reports r^2 = 0.54 with GRO-cap and r^2 = 0.49 with CAGE (two analogues of RAMPAGE). So this is a little concerning. It would've also been nice to do some interpretation on these TSS's.

Next steps: I'm planning to switch over, for the most part, to the Sharpr dataset, after today. I still plan to do a few more analysis with SuRE and I'll probably try some transfer learning / validation with the data later, but honestly the data quality issues prevents most of the interesting interpretation work. Hopefully the Sharpr dataset is more fruitful.

I think it would be interesting, however, to show that most of the variance in the SuRE dataset can be explained by patterns in GC content. The focus of their paper is analyzing which DNA sequences can initiate transcription, but they don't discuss the sequence elements creating those differences. It's hard for a deep model to find these elements without being super accurate (at least that's what I think right now, but I do plan to try a few more things with deepLIFT). If a decision tree model trained on GC content alone can explain most of the biologically "true" (i.e. consistent across replicates) variance in autonomous promoter activity, that would be a very meaningful conclusion. So I'll see if I can get good performance on this dataset with something like XGBoost as opposed to ConvNet.

Fig. 1: A selection of filters learned by the model (not cherry picked, most look something like this), visualized as PWMs by searching for sequences that strongly activate the filters. Top: TOMTOM "matches" - almost none of these are significant (pretty much all are E > 1), they are just there because it was easier to export the images that way. Bottom: Conv filter PWMs.

Some of these are clearly just sequence counters, and it makes sense that the model would learn at least some filters like this. However I would also expect some to be actual TF motifs - some of these filters seem like they have "potential" to be motifs, but they just aren't. E.g. conv1_filter63, as many of the bases are close to probability 1.

Obvious sequence counters:

logo 2

logo 5

More nuanced filters, but none of them are high-confidence matches to known TF motifs:

logo 3

logo 1

logo

logo 4

akundaje commented 7 years ago

Very nice work. I'll discuss more with you once I'm back this week.

On Jul 11, 2017 6:10 PM, "Rajiv Movva" notifications@github.com wrote:

Over the last week I've focused mostly on interpretation of models trained on the SuRE-seq dataset. Here are some broad updates and next steps for the project:

My regression model predictions were able to recapitulate replicate correlations when evaluated on a validation set. That is, spearmanr(rep1, rep2) = spearmanr(y_pred_validation, rep2) = 0.34. Notably, the model's prediction Spearman correlation was much better for replicate 2 than replicate 1 (0.34 vs. 0.14), so I stopped multitasking (on average, rep1, rep2; even predictions for average were less accurate than predictions for rep2) and trained only on replicate 2 signal.

I used the PWM cross-correlation and the Basset sequence activation methods for conv filter interpretation. Pretty much all the filters appeared to search for low-order sequence composition features (GC counting, etc. Some examples shown below.). I put the PWMs generated with the Basset method in TOMTOM, and there weren't any strong hits. However, since I had 100 filters, it's probably worth trying MODISCO to cluster some of them and then see if there are any matches to TF motifs.

Most of the fragments that were both (1) highly scoring and (2) accurately scored by the model were in retrotransposon elements (LTR12C). I compared DeepLIFT tracks for these elements to various accessibility/TF/chromHMM tracks on the WashU browser, and generally speaking there were few clear patterns / causal things that popped out. That said, I should look more closely at per-base resolution of deepLIFT scores (also grad*input, which I haven't checked yet) and see if there relevant TF motifs etc. I'm also going to look specifically at highly scoring and accurately predicted Tss/Enh chromHMM states because those will probably be more interesting than these repeat elements.

Data quality is better for fragments near GENCODE TSS's. Replicate spearman goes up to ~0.50, and model predictions are near that level as well (actually higher, 0.54). However, I tried predicting RAMPAGE TSS expression (ENCODE QC'd, from Thomas Gingeras) by taking the ~100bp TSS sequences + ~100bp downstream and ~700bp upstream (because ~900bp is the average SuRE fragment length), zero-padding the sides, and putting them through the model, and those predictions only had Spearman 0.16 with RAMPAGE scores (n = 411; I only included TSSs for chr8 and chr18, which were held out from training). Even when I tried predicting genome-wide (so, most of the TSSs would've been overlapped by some of the training SuRE fragments except those in chrs 8 and 18), the Spearman was only around 0.27 (n = 8584). The assay, however, reports r^2 = 0.54 with GRO-cap and r^2 = 0.49 with CAGE (two analogues of RAMPAGE). So this is a little concerning. It would've also been nice to do some interpretation on these TSS's.

I'm planning to switch over, for the most part, to the Sharpr dataset, after today. I still plan to do a few more analysis with SuRE and I'll probably try some transfer learning / validation with the data later, but honestly the data quality issues prevents most of the interesting interpretation work. Hopefully the Sharpr dataset is more fruitful.

I think it would be interesting, however, to show that most of the variance in the SuRE dataset can be explained by patterns in GC content. The focus of their paper is analyzing which DNA sequences can initiate transcription, but they don't discuss the sequence elements creating those differences. It's hard for a deep model to find these elements without being super accurate (at least that's what I think right now, but I do plan to try a few more things with deepLIFT). If a decision tree model trained on GC content alone can explain most of the biologically "true" (i.e. consistent across replicates) variance in autonomous promoter activity, that would be a very meaningful conclusion. So I'll see if I can get good performance on this dataset with something like XGBoost as opposed to ConvNet.

Fig. 1: A selection of filters learned by the model (not cherry picked, most look something like this), visualized as PWMs by searching for sequences that strongly activate the filters. Top: TOMTOM "matches" - almost none of these are significant (pretty much all are E > 1), they are just there because it was easier to export the images that way. Bottom: Conv filter PWMs.

Some of these are clearly just sequence counters, and it makes sense that the model would learn at least some filters like this. However I would also expect some to be actual TF motifs - some of these filters seem like they have "potential" to be motifs, but they just aren't. E.g. conv1_filter63, as many of the bases are close to probability 1.

Obvious sequence counters:

[image: logo 2] https://user-images.githubusercontent.com/17256372/28092578-c4ba09b2-6648-11e7-8bb5-9732c6151388.png

[image: logo 5] https://user-images.githubusercontent.com/17256372/28092597-d72bb226-6648-11e7-8adf-b40c964bde65.png

More nuanced filters, but none of them match any known TF motifs:

[image: logo 3] https://user-images.githubusercontent.com/17256372/28092581-cd00f892-6648-11e7-9dc0-8fa743b7a4f1.png

[image: logo 1] https://user-images.githubusercontent.com/17256372/28092574-be2702da-6648-11e7-8bdd-fcdfdb48e90a.png

[image: logo] https://user-images.githubusercontent.com/17256372/28092602-e2fd6216-6648-11e7-81b8-496e650f0ba3.png

[image: logo 4] https://user-images.githubusercontent.com/17256372/28092608-ea63579a-6648-11e7-8998-592263d35738.png

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kundajelab/mpra/issues/2#issuecomment-314586801, or mute the thread https://github.com/notifications/unsubscribe-auth/AAI7EXZKMQgDepgOcqJ6fJXZOkM4eRJBks5sM_LbgaJpZM4OHRkv .

akundaje commented 7 years ago

Btw the TGAcTCA motif is an AP1 motif

Anshul

On Jul 11, 2017 10:53 PM, "Anshul Kundaje" anshul@kundaje.net wrote:

Very nice work. I'll discuss more with you once I'm back this week.

On Jul 11, 2017 6:10 PM, "Rajiv Movva" notifications@github.com wrote:

Over the last week I've focused mostly on interpretation of models trained on the SuRE-seq dataset. Here are some broad updates and next steps for the project:

My regression model predictions were able to recapitulate replicate correlations when evaluated on a validation set. That is, spearmanr(rep1, rep2) = spearmanr(y_pred_validation, rep2) = 0.34. Notably, the model's prediction Spearman correlation was much better for replicate 2 than replicate 1 (0.34 vs. 0.14), so I stopped multitasking (on average, rep1, rep2; even predictions for average were less accurate than predictions for rep2) and trained only on replicate 2 signal.

I used the PWM cross-correlation and the Basset sequence activation methods for conv filter interpretation. Pretty much all the filters appeared to search for low-order sequence composition features (GC counting, etc. Some examples shown below.). I put the PWMs generated with the Basset method in TOMTOM, and there weren't any strong hits. However, since I had 100 filters, it's probably worth trying MODISCO to cluster some of them and then see if there are any matches to TF motifs.

Most of the fragments that were both (1) highly scoring and (2) accurately scored by the model were in retrotransposon elements (LTR12C). I compared DeepLIFT tracks for these elements to various accessibility/TF/chromHMM tracks on the WashU browser, and generally speaking there were few clear patterns / causal things that popped out. That said, I should look more closely at per-base resolution of deepLIFT scores (also grad*input, which I haven't checked yet) and see if there relevant TF motifs etc. I'm also going to look specifically at highly scoring and accurately predicted Tss/Enh chromHMM states because those will probably be more interesting than these repeat elements.

Data quality is better for fragments near GENCODE TSS's. Replicate spearman goes up to ~0.50, and model predictions are near that level as well (actually higher, 0.54). However, I tried predicting RAMPAGE TSS expression (ENCODE QC'd, from Thomas Gingeras) by taking the ~100bp TSS sequences + ~100bp downstream and ~700bp upstream (because ~900bp is the average SuRE fragment length), zero-padding the sides, and putting them through the model, and those predictions only had Spearman 0.16 with RAMPAGE scores (n = 411; I only included TSSs for chr8 and chr18, which were held out from training). Even when I tried predicting genome-wide (so, most of the TSSs would've been overlapped by some of the training SuRE fragments except those in chrs 8 and 18), the Spearman was only around 0.27 (n = 8584). The assay, however, reports r^2 = 0.54 with GRO-cap and r^2 = 0.49 with CAGE (two analogues of RAMPAGE). So this is a little concerning. It would've also been nice to do some interpretation on these TSS's.

I'm planning to switch over, for the most part, to the Sharpr dataset, after today. I still plan to do a few more analysis with SuRE and I'll probably try some transfer learning / validation with the data later, but honestly the data quality issues prevents most of the interesting interpretation work. Hopefully the Sharpr dataset is more fruitful.

I think it would be interesting, however, to show that most of the variance in the SuRE dataset can be explained by patterns in GC content. The focus of their paper is analyzing which DNA sequences can initiate transcription, but they don't discuss the sequence elements creating those differences. It's hard for a deep model to find these elements without being super accurate (at least that's what I think right now, but I do plan to try a few more things with deepLIFT). If a decision tree model trained on GC content alone can explain most of the biologically "true" (i.e. consistent across replicates) variance in autonomous promoter activity, that would be a very meaningful conclusion. So I'll see if I can get good performance on this dataset with something like XGBoost as opposed to ConvNet.

Fig. 1: A selection of filters learned by the model (not cherry picked, most look something like this), visualized as PWMs by searching for sequences that strongly activate the filters. Top: TOMTOM "matches" - almost none of these are significant (pretty much all are E > 1), they are just there because it was easier to export the images that way. Bottom: Conv filter PWMs.

Some of these are clearly just sequence counters, and it makes sense that the model would learn at least some filters like this. However I would also expect some to be actual TF motifs - some of these filters seem like they have "potential" to be motifs, but they just aren't. E.g. conv1_filter63, as many of the bases are close to probability 1.

Obvious sequence counters:

[image: logo 2] https://user-images.githubusercontent.com/17256372/28092578-c4ba09b2-6648-11e7-8bb5-9732c6151388.png

[image: logo 5] https://user-images.githubusercontent.com/17256372/28092597-d72bb226-6648-11e7-8adf-b40c964bde65.png

More nuanced filters, but none of them match any known TF motifs:

[image: logo 3] https://user-images.githubusercontent.com/17256372/28092581-cd00f892-6648-11e7-9dc0-8fa743b7a4f1.png

[image: logo 1] https://user-images.githubusercontent.com/17256372/28092574-be2702da-6648-11e7-8bdd-fcdfdb48e90a.png

[image: logo] https://user-images.githubusercontent.com/17256372/28092602-e2fd6216-6648-11e7-81b8-496e650f0ba3.png

[image: logo 4] https://user-images.githubusercontent.com/17256372/28092608-ea63579a-6648-11e7-8998-592263d35738.png

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kundajelab/mpra/issues/2#issuecomment-314586801, or mute the thread https://github.com/notifications/unsubscribe-auth/AAI7EXZKMQgDepgOcqJ6fJXZOkM4eRJBks5sM_LbgaJpZM4OHRkv .

akundaje commented 7 years ago

The other novel motifs also look interesting. I think there is something there. More later.

Anshul

On Jul 11, 2017 10:53 PM, "Anshul Kundaje" anshul@kundaje.net wrote:

Very nice work. I'll discuss more with you once I'm back this week.

On Jul 11, 2017 6:10 PM, "Rajiv Movva" notifications@github.com wrote:

Over the last week I've focused mostly on interpretation of models trained on the SuRE-seq dataset. Here are some broad updates and next steps for the project:

My regression model predictions were able to recapitulate replicate correlations when evaluated on a validation set. That is, spearmanr(rep1, rep2) = spearmanr(y_pred_validation, rep2) = 0.34. Notably, the model's prediction Spearman correlation was much better for replicate 2 than replicate 1 (0.34 vs. 0.14), so I stopped multitasking (on average, rep1, rep2; even predictions for average were less accurate than predictions for rep2) and trained only on replicate 2 signal.

I used the PWM cross-correlation and the Basset sequence activation methods for conv filter interpretation. Pretty much all the filters appeared to search for low-order sequence composition features (GC counting, etc. Some examples shown below.). I put the PWMs generated with the Basset method in TOMTOM, and there weren't any strong hits. However, since I had 100 filters, it's probably worth trying MODISCO to cluster some of them and then see if there are any matches to TF motifs.

Most of the fragments that were both (1) highly scoring and (2) accurately scored by the model were in retrotransposon elements (LTR12C). I compared DeepLIFT tracks for these elements to various accessibility/TF/chromHMM tracks on the WashU browser, and generally speaking there were few clear patterns / causal things that popped out. That said, I should look more closely at per-base resolution of deepLIFT scores (also grad*input, which I haven't checked yet) and see if there relevant TF motifs etc. I'm also going to look specifically at highly scoring and accurately predicted Tss/Enh chromHMM states because those will probably be more interesting than these repeat elements.

Data quality is better for fragments near GENCODE TSS's. Replicate spearman goes up to ~0.50, and model predictions are near that level as well (actually higher, 0.54). However, I tried predicting RAMPAGE TSS expression (ENCODE QC'd, from Thomas Gingeras) by taking the ~100bp TSS sequences + ~100bp downstream and ~700bp upstream (because ~900bp is the average SuRE fragment length), zero-padding the sides, and putting them through the model, and those predictions only had Spearman 0.16 with RAMPAGE scores (n = 411; I only included TSSs for chr8 and chr18, which were held out from training). Even when I tried predicting genome-wide (so, most of the TSSs would've been overlapped by some of the training SuRE fragments except those in chrs 8 and 18), the Spearman was only around 0.27 (n = 8584). The assay, however, reports r^2 = 0.54 with GRO-cap and r^2 = 0.49 with CAGE (two analogues of RAMPAGE). So this is a little concerning. It would've also been nice to do some interpretation on these TSS's.

I'm planning to switch over, for the most part, to the Sharpr dataset, after today. I still plan to do a few more analysis with SuRE and I'll probably try some transfer learning / validation with the data later, but honestly the data quality issues prevents most of the interesting interpretation work. Hopefully the Sharpr dataset is more fruitful.

I think it would be interesting, however, to show that most of the variance in the SuRE dataset can be explained by patterns in GC content. The focus of their paper is analyzing which DNA sequences can initiate transcription, but they don't discuss the sequence elements creating those differences. It's hard for a deep model to find these elements without being super accurate (at least that's what I think right now, but I do plan to try a few more things with deepLIFT). If a decision tree model trained on GC content alone can explain most of the biologically "true" (i.e. consistent across replicates) variance in autonomous promoter activity, that would be a very meaningful conclusion. So I'll see if I can get good performance on this dataset with something like XGBoost as opposed to ConvNet.

Fig. 1: A selection of filters learned by the model (not cherry picked, most look something like this), visualized as PWMs by searching for sequences that strongly activate the filters. Top: TOMTOM "matches" - almost none of these are significant (pretty much all are E > 1), they are just there because it was easier to export the images that way. Bottom: Conv filter PWMs.

Some of these are clearly just sequence counters, and it makes sense that the model would learn at least some filters like this. However I would also expect some to be actual TF motifs - some of these filters seem like they have "potential" to be motifs, but they just aren't. E.g. conv1_filter63, as many of the bases are close to probability 1.

Obvious sequence counters:

[image: logo 2] https://user-images.githubusercontent.com/17256372/28092578-c4ba09b2-6648-11e7-8bb5-9732c6151388.png

[image: logo 5] https://user-images.githubusercontent.com/17256372/28092597-d72bb226-6648-11e7-8adf-b40c964bde65.png

More nuanced filters, but none of them match any known TF motifs:

[image: logo 3] https://user-images.githubusercontent.com/17256372/28092581-cd00f892-6648-11e7-9dc0-8fa743b7a4f1.png

[image: logo 1] https://user-images.githubusercontent.com/17256372/28092574-be2702da-6648-11e7-8bdd-fcdfdb48e90a.png

[image: logo] https://user-images.githubusercontent.com/17256372/28092602-e2fd6216-6648-11e7-81b8-496e650f0ba3.png

[image: logo 4] https://user-images.githubusercontent.com/17256372/28092608-ea63579a-6648-11e7-8998-592263d35738.png

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kundajelab/mpra/issues/2#issuecomment-314586801, or mute the thread https://github.com/notifications/unsubscribe-auth/AAI7EXZKMQgDepgOcqJ6fJXZOkM4eRJBks5sM_LbgaJpZM4OHRkv .

rmovva commented 7 years ago

Haha, sorry for not being clear about this - but for each one of the images, the PWM on top is the actual motif from the JASPAR database. The bottom ones are the model's filter PWMs. So e.g. the TGAcTCA motif is not from the model, it's a JASPAR motif. And none of these "matches" have very significant E-values, except the SPI1 GC-counting filter. Considering that, I'm assuming the filters are less interesting?

Regardless, I'm looking forward to discussing things soon; I'll be sure to do some additional interpretation (and hopefully will have some preliminary SHARPR results as well) by then.

rmovva commented 7 years ago

19 July 2017

Roadmap of remaining explorations with the SuRE-seq data:

GC model: I still would like to train a strong baseline model (eg XGB) on GC-related features to see if that can perform comparably to CNNs. I've stalled on this for way too long, but I'm definitely going to finish this by today.
Understanding nucleosome positioning: This doesn't have to be done now, but an idea for the future is to look into actual sequence patterns that define nucleosome occupancy, as that's currently not well-characterized (check Eran Segal paper for current understanding of the field). Anshul mentioned doing a Fourier transform on deepLIFT signal; if that revealed any cool patterns / frequencies relating to promoter activity, that'd be super cool. My gut is that the deepLIFT tracks aren't accurate enough for that to be very viable yet, but it's something to think about.
Further deepLIFT analysis: In general, I didn't do a very thorough analysis of deepLIFT scores vs. biological signal for the SuRE data. Especially after seeing that predictions for the Pol2 state are much better than overall -- around 0.50 Spearman -- I wonder if anything will turn up. E.g. averaging deepLIFT scores for bases in SP1 or AP1 (promoter TFs) motifs, and comparing that to scores for other bases, or doing differential motif discovery for sequences with high deepLIFT-scoring sequences vs. low scoring ones. Basically, any systematic interpretation analysis that could show that the model learned something interesting. Other ideas would be great here.

Also, it's worth noting that the SuRE-seq data is probably confounded by the effects that are discussed in the new bioRxiv paper from Stark lab: http://www.biorxiv.org/content/biorxiv/early/2017/07/17/164590.full.pdf. TLDR is that the plasmid construct (pSMART) that's transfected into cells for their MPRA has a standard sequence called the origin-of-replication (ORI). The ORI, however, has strong core promoter elements itself, so most transcripts are actually initiated by it, not the candidate promoter sequences that are tested by SuRE-seq. It's likely, then, that the candidate fragments are acting as enhancers for the ORI, rather than autonomous promoters.

As I write this I'm also realizing another confounder with the SuRE-seq data. They didn't rigorously quantify the representation of various fragments in their input library, but they glossed over this issue in the methods section, saying that it was too hard for a library at this scale of complexity (150M different fragments). So the whole normalization for this assay (typically output RNA divided by input DNA, here it is output RNA divided by an inaccurate estimate of input DNA) is pretty sketchy. I didn't realize before, but this is probably actually a significant issue: if fragments show up more in the input library (which is shared between replicates), they are naturally going to show up more in the output library, meaning that without robust normalization, a nontrivial fraction of replicate-replicate correlation will be driven by input DNA count - a biologically irrelevant feature. This issue further decreases the effective replicate correlation.

akundaje commented 7 years ago

Definitely try this with grad*input as well.

On Wed, Jul 19, 2017 at 12:55 PM, Rajiv Movva notifications@github.com wrote:

19 July 2017

Roadmap of remaining explorations with the SuRE-seq data:

GC model: I still would like to train a strong baseline model (eg XGB) on GC-related features to see if that can perform comparably to CNNs. I've stalled on this for way too long, but I'm definitely going to finish this by today.

Understanding nucleosome positioning: This doesn't have to be done now, but an idea for the future is to look into actual sequence patterns that define nucleosome occupancy, as that's currently not well-characterized (check Eran Segal paper for current understanding of the field). Anshul mentioned doing a Fourier transform on deepLIFT signal; if that revealed any cool patterns / frequencies relating to promoter activity, that'd be super cool. My gut is that the deepLIFT tracks aren't accurate enough for that to be very viable yet, but it's something to think about.

Further deepLIFT analysis: In general, I didn't do a very thorough analysis of deepLIFT scores vs. biological signal for the SuRE data. Especially after seeing that predictions for the Pol2 state are much better than overall -- around 0.50 Spearman -- I wonder if anything will turn up. E.g. averaging deepLIFT scores for bases in SP1 or AP1 (promoter TFs) motifs, and comparing that to scores for other bases, or doing differential motif discovery for sequences with high deepLIFT-scoring sequences vs. low scoring ones. Basically, any systematic interpretation analysis that could show that the model learned something interesting. Other ideas would be great here.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kundajelab/mpra/issues/2#issuecomment-316498512, or mute the thread https://github.com/notifications/unsubscribe-auth/AAI7EfwxowpaJES_RKMMJvHx-Dv6mBlyks5sPl9MgaJpZM4OHRkv .