kundajelab / mpra

Deep learning MPRAs
9 stars 1 forks source link

Deep Learning on the Sharpr Dataset #4

Closed rmovva closed 6 years ago

rmovva commented 7 years ago

SuRE-seq, the previous dataset I was working with, is fundamentally a promoter assay, and therefore it was hard to find many biologically relevant features other than GC content (which affects nucleosome occupancy, and therefore accessibility of the promoter to Pol2 etc.) and some semblance of a TATA box. That said, there are still a few more things I'd like to try with that data, which I outlined here: https://github.com/kundajelab/mpra/issues/2#issuecomment-316498512.

I've started working with the Sharpr dataset, which is an enhancer MPRA performed in HepG2 and K562 cells. An enhancer is placed upstream of two promoter types - minimal (minP) or strong (SV40P) - and downstream of the promoter is a barcode sequence uniquely mapping to the enhancer. The authors selected DNase peak regions in HepG2, K562, HUVEC, and H1-hESCs (DNase peaks from the latter two cell types are included to create diversity in the types of regions tested, since most of the HepG2 DNase peaks will be activating enhancers in HepG2 cells). 295bp regions are selected centered at the peak summits, and these 295bp regions are tiled with 145bp MPRA constructs, starting at 1bp-145bp, 6bp-150bp, ..., 151bp-295bp (31 constructs per region). Each of these 145bp constructs are used in the MPRA - there are 15720 regions total, so 15720*31 = 487K tested fragments per (cell type, promoter) pair. Each experiment is done in replicate, so there are in total 8 experiments (K562_minP_rep1, K562_minP_rep2, K562_SV40P_rep1, HepG2_minP_rep1, etc.)

To start, I did some quality control to see if this data was much better than SuRE. Here are some figures showing replicate-replicate correlation for the four different experiments (2 cell types, 2 promoters).

For all of these plots, more red corresponds to more input plasmids for the fragment, and more blue corresponds to fewer. Generally, I'd expect red points to be more reproducible between replicates because there's more data corresponding to those fragments (this isn't obviously true, from the plots however).

First, correlation of RNA read counts between replicates (sorry for messy x-labels). The numbers in the top left of the plots are the Spearman correlations.

rnacountsreplicates

Looks pretty good, but a lot of that correlation is driven by the amount of plasmid DNA corresponding to each fragment -- i.e., in general, more input plasmids means more RNA output. So let's look at these correlations again, normalizing for the amount of input DNA for each fragment*.

normsignalsreplicates

The correlations are significantly worse here, as expected. It varies by experiment, but the biologically reproducible signal is about ~0.40 on average. It's unclear how the amount of DNA plasmid affects the correlation between replicates from the colors of the points here, there's just too much going on. What happens to these correlations if we look at only the fragments with >20 plasmid DNA reads (i.e., they are nontrivially represented in the input library)?

normsignalswith20plasmids

Interestingly, the correlations go down for both celltypes with the minimal promoter, but the correlations increase slightly for the SV40 promoter. Not sure why this is / if this is statistically significant, so let's look at the full curve of replicate Spearman correlation vs. the threshold of input plasmid DNA. I would expect these graphs to have a positive correlation; that is, reproducibility should generally increase with higher thresholds of plasmid counts. In this plot, point-size is proportional to the number of fragments considered in the correlation at each threshold.

replicatecorrvsplasmidthresholds

Ok, good these graphs have the positive correlation I expected. Not sure why the top two graphs have an initial dip, though.

Anyway, it's still unclear how we can use this information to potentially curate the training set. Let's instead plot the number of fragments corresponding to each point on the previous plot, so we can see if there's a plasmid threshold that allows for a reasonable number of training fragments with a significant boost to data quality (as measured by replicate Spearman correlation).

replicatecorrvsnumfragments

Looks like no. The relationship is basically linear, so any improvement in correlation will require a linearly proportional loss in the amount of training data. So there's no obvious curation to be done with this dataset. An approach to weight training examples based on replicate variance might help here? @akundaje @pgreenside do you think this would help at all?

I've starting training without any curation / weighting / augmentation, and I'll post an update about how that's going later today as well.

*The exact transformation is log2(fold change), and to avoid divide by zero errors, a pseudo read count of 1 is added to both RNA reads and DNA reads. Also, the number of RNA and DNA reads are divided by the total number of reads for the relevant experiment, to get a fraction read count. So the actual quantity is log2((RNA + 1) / sum(RNA reads)) - log2((DNA+1) / sum(DNA reads)). [Note that this difference is equivalent to doing a log of the quotient of the values inside each of the logs, I just wrote it that way for readability.]

akundaje commented 7 years ago

You should talk to Joe about this. He spent some time looking into normalization.

Anshul

On Jul 19, 2017 2:47 PM, "Rajiv Movva" notifications@github.com wrote:

SuRE-seq, the previous dataset I was working with, is fundamentally a promoter assay, and therefore it was hard to find many biologically relevant features other than GC content (which affects nucleosome occupancy, and therefore accessibility of the promoter to Pol2 etc.) and some semblance of a TATA box. That said, there are still a few more things I'd like to try with that data, which I outlined here: #2 (comment) https://github.com/kundajelab/mpra/issues/2#issuecomment-316498512.

I've started working with the Sharpr dataset, which is an enhancer MPRA performed in HepG2 and K562 cells. An enhancer is placed upstream of two promoter types - minimal (minP) or strong (SV40P) - and downstream of the promoter is a barcode sequence uniquely mapping to the enhancer. The authors selected DNase peak regions in HepG2, K562, HUVEC, and H1-hESCs (DNase peaks from the latter two cell types are included to create diversity in the types of regions tested, since most of the HepG2 DNase peaks will be activating enhancers in HepG2 cells). 295bp regions are selected centered at the peak summits, and these 295bp regions are tiled with 145bp MPRA constructs, starting at 1bp-145bp, 6bp-150bp, ..., 151bp-295bp (31 constructs per region). Each of these 145bp constructs are used in the MPRA - there are 15720 regions total, so 15720*31 = 487K tested fragments per (cell type, promoter) pair. Each experiment is done in replicate, so there are in total 8 experiments (K562_minP_rep1, K562_minP_rep2, K562_SV40P_rep1, HepG2_minP_rep1, etc.)

To start, I did some quality control to see if this data was much better than SuRE. Here are some figures showing replicate-replicate correlation for the four different experiments (2 cell types, 2 promoters).

For all of these plots, more red corresponds to more input plasmids for the fragment, and more blue corresponds to fewer. Generally, I'd expect red points to be more reproducible between replicates because there's more data corresponding to those fragments (this isn't obviously true, from the plots however).

First, correlation of RNA read counts between replicates (sorry for messy x-labels). The numbers in the top left of the plots are the Spearman correlations.

[image: rnacountsreplicates] https://user-images.githubusercontent.com/17256372/28390988-0c8f0910-6c91-11e7-9044-5fa8ccde1781.png

Looks pretty good, but a lot of that correlation is driven by the amount of plasmid DNA corresponding to each fragment -- i.e., in general, more input plasmids means more RNA output. So let's look at these correlations again, normalizing for the amount of input DNA for each fragment*.

[image: normsignalsreplicates] https://user-images.githubusercontent.com/17256372/28391000-15095ec4-6c91-11e7-9321-45ebf942b139.png

The correlations are significantly worse here, as expected. It varies by experiment, but the biologically reproducible signal is about ~0.40 on average. It's unclear how the amount of DNA plasmid affects the correlation between replicates from the colors of the points here, there's just too much going on. What happens to these correlations if we look at only the fragments with >20 plasmid DNA reads (i.e., they are nontrivially represented in the input library)?

[image: normsignalswith20plasmids] https://user-images.githubusercontent.com/17256372/28391007-1a9c3136-6c91-11e7-9581-983e861923e5.png

Interestingly, the correlations go down for both celltypes with the minimal promoter, but the correlations increase slightly for the SV40 promoter. Not sure why this is / if this is statistically significant, so let's look at the full curve of replicate Spearman correlation vs. the threshold of input plasmid DNA. I would expect these graphs to have a positive correlation; that is, reproducibility should generally increase with higher thresholds of plasmid counts. In this plot, point-size is proportional to the number of fragments considered in the correlation at each threshold.

[image: replicatecorrvsplasmidthresholds] https://user-images.githubusercontent.com/17256372/28391009-1f613d7e-6c91-11e7-9388-74fc4c2f42de.png

Ok, good these graphs have the positive correlation I expected. Not sure why the top two graphs have an initial dip, though.

Anyway, it's still unclear how we can use this information to potentially curate the training set. Let's instead plot the number of fragments corresponding to each point on the previous plot, so we can see if there's a plasmid threshold that allows for a reasonable number of training fragments with a significant boost to data quality (as measured by replicate Spearman correlation).

[image: replicatecorrvsnumfragments] https://user-images.githubusercontent.com/17256372/28391021-26888c9c-6c91-11e7-9a0d-3663db246a57.png

Looks like no. The relationship is basically linear, so any improvement in correlation will require a linearly proportional loss in the amount of training data. So there's no obvious curation to be done with this dataset. An approach to weight training examples based on replicate variance might help here? @akundaje https://github.com/akundaje @pgreenside https://github.com/pgreenside do you think this would help at all?

I've starting training without any curation / weighting / augmentation, and I'll post an update about how that's going later today as well.

*The exact transformation is log2(fold change), and to avoid divide by zero errors, a pseudo read count of 1 is added to both RNA reads and DNA reads. Also, the number of RNA and DNA reads are divided by the total number of reads for the relevant experiment, to get a fraction read count. So the actual quantity is log2((RNA + 1) / sum(RNA reads)) - log2((DNA+1) / sum(DNA reads)). [Note that this difference is equivalent to doing a log of the quotient of the values inside each of the logs, I just wrote it that way for readability.]

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kundajelab/mpra/issues/4, or mute the thread https://github.com/notifications/unsubscribe-auth/AAI7EXD2Ivh6EOfQBYCe33DuyjbiGhByks5sPnmKgaJpZM4OdV2o .

akundaje commented 7 years ago

Also what normalization do they use in the SharpR paper

On Jul 19, 2017 2:47 PM, "Rajiv Movva" notifications@github.com wrote:

SuRE-seq, the previous dataset I was working with, is fundamentally a promoter assay, and therefore it was hard to find many biologically relevant features other than GC content (which affects nucleosome occupancy, and therefore accessibility of the promoter to Pol2 etc.) and some semblance of a TATA box. That said, there are still a few more things I'd like to try with that data, which I outlined here: #2 (comment) https://github.com/kundajelab/mpra/issues/2#issuecomment-316498512.

I've started working with the Sharpr dataset, which is an enhancer MPRA performed in HepG2 and K562 cells. An enhancer is placed upstream of two promoter types - minimal (minP) or strong (SV40P) - and downstream of the promoter is a barcode sequence uniquely mapping to the enhancer. The authors selected DNase peak regions in HepG2, K562, HUVEC, and H1-hESCs (DNase peaks from the latter two cell types are included to create diversity in the types of regions tested, since most of the HepG2 DNase peaks will be activating enhancers in HepG2 cells). 295bp regions are selected centered at the peak summits, and these 295bp regions are tiled with 145bp MPRA constructs, starting at 1bp-145bp, 6bp-150bp, ..., 151bp-295bp (31 constructs per region). Each of these 145bp constructs are used in the MPRA - there are 15720 regions total, so 15720*31 = 487K tested fragments per (cell type, promoter) pair. Each experiment is done in replicate, so there are in total 8 experiments (K562_minP_rep1, K562_minP_rep2, K562_SV40P_rep1, HepG2_minP_rep1, etc.)

To start, I did some quality control to see if this data was much better than SuRE. Here are some figures showing replicate-replicate correlation for the four different experiments (2 cell types, 2 promoters).

For all of these plots, more red corresponds to more input plasmids for the fragment, and more blue corresponds to fewer. Generally, I'd expect red points to be more reproducible between replicates because there's more data corresponding to those fragments (this isn't obviously true, from the plots however).

First, correlation of RNA read counts between replicates (sorry for messy x-labels). The numbers in the top left of the plots are the Spearman correlations.

[image: rnacountsreplicates] https://user-images.githubusercontent.com/17256372/28390988-0c8f0910-6c91-11e7-9044-5fa8ccde1781.png

Looks pretty good, but a lot of that correlation is driven by the amount of plasmid DNA corresponding to each fragment -- i.e., in general, more input plasmids means more RNA output. So let's look at these correlations again, normalizing for the amount of input DNA for each fragment*.

[image: normsignalsreplicates] https://user-images.githubusercontent.com/17256372/28391000-15095ec4-6c91-11e7-9321-45ebf942b139.png

The correlations are significantly worse here, as expected. It varies by experiment, but the biologically reproducible signal is about ~0.40 on average. It's unclear how the amount of DNA plasmid affects the correlation between replicates from the colors of the points here, there's just too much going on. What happens to these correlations if we look at only the fragments with >20 plasmid DNA reads (i.e., they are nontrivially represented in the input library)?

[image: normsignalswith20plasmids] https://user-images.githubusercontent.com/17256372/28391007-1a9c3136-6c91-11e7-9581-983e861923e5.png

Interestingly, the correlations go down for both celltypes with the minimal promoter, but the correlations increase slightly for the SV40 promoter. Not sure why this is / if this is statistically significant, so let's look at the full curve of replicate Spearman correlation vs. the threshold of input plasmid DNA. I would expect these graphs to have a positive correlation; that is, reproducibility should generally increase with higher thresholds of plasmid counts. In this plot, point-size is proportional to the number of fragments considered in the correlation at each threshold.

[image: replicatecorrvsplasmidthresholds] https://user-images.githubusercontent.com/17256372/28391009-1f613d7e-6c91-11e7-9388-74fc4c2f42de.png

Ok, good these graphs have the positive correlation I expected. Not sure why the top two graphs have an initial dip, though.

Anyway, it's still unclear how we can use this information to potentially curate the training set. Let's instead plot the number of fragments corresponding to each point on the previous plot, so we can see if there's a plasmid threshold that allows for a reasonable number of training fragments with a significant boost to data quality (as measured by replicate Spearman correlation).

[image: replicatecorrvsnumfragments] https://user-images.githubusercontent.com/17256372/28391021-26888c9c-6c91-11e7-9a0d-3663db246a57.png

Looks like no. The relationship is basically linear, so any improvement in correlation will require a linearly proportional loss in the amount of training data. So there's no obvious curation to be done with this dataset. An approach to weight training examples based on replicate variance might help here? @akundaje https://github.com/akundaje @pgreenside https://github.com/pgreenside do you think this would help at all?

I've starting training without any curation / weighting / augmentation, and I'll post an update about how that's going later today as well.

*The exact transformation is log2(fold change), and to avoid divide by zero errors, a pseudo read count of 1 is added to both RNA reads and DNA reads. Also, the number of RNA and DNA reads are divided by the total number of reads for the relevant experiment, to get a fraction read count. So the actual quantity is log2((RNA + 1) / sum(RNA reads)) - log2((DNA+1) / sum(DNA reads)). [Note that this difference is equivalent to doing a log of the quotient of the values inside each of the logs, I just wrote it that way for readability.]

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kundajelab/mpra/issues/4, or mute the thread https://github.com/notifications/unsubscribe-auth/AAI7EXD2Ivh6EOfQBYCe33DuyjbiGhByks5sPnmKgaJpZM4OdV2o .

akundaje commented 7 years ago

Also dividing by total number of reads is not necessarily the best type of normalization. This will have the same issue as RPKMs in RNASEQ being incomparable across conditions without additional normalization.

On Jul 19, 2017 6:34 PM, "Anshul Kundaje" anshul@kundaje.net wrote:

Also what normalization do they use in the SharpR paper

On Jul 19, 2017 2:47 PM, "Rajiv Movva" notifications@github.com wrote:

SuRE-seq, the previous dataset I was working with, is fundamentally a promoter assay, and therefore it was hard to find many biologically relevant features other than GC content (which affects nucleosome occupancy, and therefore accessibility of the promoter to Pol2 etc.) and some semblance of a TATA box. That said, there are still a few more things I'd like to try with that data, which I outlined here: #2 (comment) https://github.com/kundajelab/mpra/issues/2#issuecomment-316498512.

I've started working with the Sharpr dataset, which is an enhancer MPRA performed in HepG2 and K562 cells. An enhancer is placed upstream of two promoter types - minimal (minP) or strong (SV40P) - and downstream of the promoter is a barcode sequence uniquely mapping to the enhancer. The authors selected DNase peak regions in HepG2, K562, HUVEC, and H1-hESCs (DNase peaks from the latter two cell types are included to create diversity in the types of regions tested, since most of the HepG2 DNase peaks will be activating enhancers in HepG2 cells). 295bp regions are selected centered at the peak summits, and these 295bp regions are tiled with 145bp MPRA constructs, starting at 1bp-145bp, 6bp-150bp, ..., 151bp-295bp (31 constructs per region). Each of these 145bp constructs are used in the MPRA - there are 15720 regions total, so 15720*31 = 487K tested fragments per (cell type, promoter) pair. Each experiment is done in replicate, so there are in total 8 experiments (K562_minP_rep1, K562_minP_rep2, K562_SV40P_rep1, HepG2_minP_rep1, etc.)

To start, I did some quality control to see if this data was much better than SuRE. Here are some figures showing replicate-replicate correlation for the four different experiments (2 cell types, 2 promoters).

For all of these plots, more red corresponds to more input plasmids for the fragment, and more blue corresponds to fewer. Generally, I'd expect red points to be more reproducible between replicates because there's more data corresponding to those fragments (this isn't obviously true, from the plots however).

First, correlation of RNA read counts between replicates (sorry for messy x-labels). The numbers in the top left of the plots are the Spearman correlations.

[image: rnacountsreplicates] https://user-images.githubusercontent.com/17256372/28390988-0c8f0910-6c91-11e7-9044-5fa8ccde1781.png

Looks pretty good, but a lot of that correlation is driven by the amount of plasmid DNA corresponding to each fragment -- i.e., in general, more input plasmids means more RNA output. So let's look at these correlations again, normalizing for the amount of input DNA for each fragment*.

[image: normsignalsreplicates] https://user-images.githubusercontent.com/17256372/28391000-15095ec4-6c91-11e7-9321-45ebf942b139.png

The correlations are significantly worse here, as expected. It varies by experiment, but the biologically reproducible signal is about ~0.40 on average. It's unclear how the amount of DNA plasmid affects the correlation between replicates from the colors of the points here, there's just too much going on. What happens to these correlations if we look at only the fragments with >20 plasmid DNA reads (i.e., they are nontrivially represented in the input library)?

[image: normsignalswith20plasmids] https://user-images.githubusercontent.com/17256372/28391007-1a9c3136-6c91-11e7-9581-983e861923e5.png

Interestingly, the correlations go down for both celltypes with the minimal promoter, but the correlations increase slightly for the SV40 promoter. Not sure why this is / if this is statistically significant, so let's look at the full curve of replicate Spearman correlation vs. the threshold of input plasmid DNA. I would expect these graphs to have a positive correlation; that is, reproducibility should generally increase with higher thresholds of plasmid counts. In this plot, point-size is proportional to the number of fragments considered in the correlation at each threshold.

[image: replicatecorrvsplasmidthresholds] https://user-images.githubusercontent.com/17256372/28391009-1f613d7e-6c91-11e7-9388-74fc4c2f42de.png

Ok, good these graphs have the positive correlation I expected. Not sure why the top two graphs have an initial dip, though.

Anyway, it's still unclear how we can use this information to potentially curate the training set. Let's instead plot the number of fragments corresponding to each point on the previous plot, so we can see if there's a plasmid threshold that allows for a reasonable number of training fragments with a significant boost to data quality (as measured by replicate Spearman correlation).

[image: replicatecorrvsnumfragments] https://user-images.githubusercontent.com/17256372/28391021-26888c9c-6c91-11e7-9a0d-3663db246a57.png

Looks like no. The relationship is basically linear, so any improvement in correlation will require a linearly proportional loss in the amount of training data. So there's no obvious curation to be done with this dataset. An approach to weight training examples based on replicate variance might help here? @akundaje https://github.com/akundaje @pgreenside https://github.com/pgreenside do you think this would help at all?

I've starting training without any curation / weighting / augmentation, and I'll post an update about how that's going later today as well.

*The exact transformation is log2(fold change), and to avoid divide by zero errors, a pseudo read count of 1 is added to both RNA reads and DNA reads. Also, the number of RNA and DNA reads are divided by the total number of reads for the relevant experiment, to get a fraction read count. So the actual quantity is log2((RNA + 1) / sum(RNA reads)) - log2((DNA+1) / sum(DNA reads)). [Note that this difference is equivalent to doing a log of the quotient of the values inside each of the logs, I just wrote it that way for readability.]

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kundajelab/mpra/issues/4, or mute the thread https://github.com/notifications/unsubscribe-auth/AAI7EXD2Ivh6EOfQBYCe33DuyjbiGhByks5sPnmKgaJpZM4OdV2o .

akundaje commented 7 years ago

I'll try to meet with you tomorrow to discuss.

On Jul 19, 2017 6:41 PM, wrote:

Also dividing by total number of reads is not necessarily the best type of normalization. This will have the same issue as RPKMs in RNASEQ being incomparable across conditions without additional normalization.

On Jul 19, 2017 6:34 PM, "Anshul Kundaje" anshul@kundaje.net wrote:

Also what normalization do they use in the SharpR paper

On Jul 19, 2017 2:47 PM, "Rajiv Movva" notifications@github.com wrote:

SuRE-seq, the previous dataset I was working with, is fundamentally a promoter assay, and therefore it was hard to find many biologically relevant features other than GC content (which affects nucleosome occupancy, and therefore accessibility of the promoter to Pol2 etc.) and some semblance of a TATA box. That said, there are still a few more things I'd like to try with that data, which I outlined here: #2 (comment) https://github.com/kundajelab/mpra/issues/2#issuecomment-316498512.

I've started working with the Sharpr dataset, which is an enhancer MPRA performed in HepG2 and K562 cells. An enhancer is placed upstream of two promoter types - minimal (minP) or strong (SV40P) - and downstream of the promoter is a barcode sequence uniquely mapping to the enhancer. The authors selected DNase peak regions in HepG2, K562, HUVEC, and H1-hESCs (DNase peaks from the latter two cell types are included to create diversity in the types of regions tested, since most of the HepG2 DNase peaks will be activating enhancers in HepG2 cells). 295bp regions are selected centered at the peak summits, and these 295bp regions are tiled with 145bp MPRA constructs, starting at 1bp-145bp, 6bp-150bp, ..., 151bp-295bp (31 constructs per region). Each of these 145bp constructs are used in the MPRA - there are 15720 regions total, so 15720*31 = 487K tested fragments per (cell type, promoter) pair. Each experiment is done in replicate, so there are in total 8 experiments (K562_minP_rep1, K562_minP_rep2, K562_SV40P_rep1, HepG2_minP_rep1, etc.)

To start, I did some quality control to see if this data was much better than SuRE. Here are some figures showing replicate-replicate correlation for the four different experiments (2 cell types, 2 promoters).

For all of these plots, more red corresponds to more input plasmids for the fragment, and more blue corresponds to fewer. Generally, I'd expect red points to be more reproducible between replicates because there's more data corresponding to those fragments (this isn't obviously true, from the plots however).

First, correlation of RNA read counts between replicates (sorry for messy x-labels). The numbers in the top left of the plots are the Spearman correlations.

[image: rnacountsreplicates] https://user-images.githubusercontent.com/17256372/28390988-0c8f0910-6c91-11e7-9044-5fa8ccde1781.png

Looks pretty good, but a lot of that correlation is driven by the amount of plasmid DNA corresponding to each fragment -- i.e., in general, more input plasmids means more RNA output. So let's look at these correlations again, normalizing for the amount of input DNA for each fragment*.

[image: normsignalsreplicates] https://user-images.githubusercontent.com/17256372/28391000-15095ec4-6c91-11e7-9321-45ebf942b139.png

The correlations are significantly worse here, as expected. It varies by experiment, but the biologically reproducible signal is about ~0.40 on average. It's unclear how the amount of DNA plasmid affects the correlation between replicates from the colors of the points here, there's just too much going on. What happens to these correlations if we look at only the fragments with >20 plasmid DNA reads (i.e., they are nontrivially represented in the input library)?

[image: normsignalswith20plasmids] https://user-images.githubusercontent.com/17256372/28391007-1a9c3136-6c91-11e7-9581-983e861923e5.png

Interestingly, the correlations go down for both celltypes with the minimal promoter, but the correlations increase slightly for the SV40 promoter. Not sure why this is / if this is statistically significant, so let's look at the full curve of replicate Spearman correlation vs. the threshold of input plasmid DNA. I would expect these graphs to have a positive correlation; that is, reproducibility should generally increase with higher thresholds of plasmid counts. In this plot, point-size is proportional to the number of fragments considered in the correlation at each threshold.

[image: replicatecorrvsplasmidthresholds] https://user-images.githubusercontent.com/17256372/28391009-1f613d7e-6c91-11e7-9388-74fc4c2f42de.png

Ok, good these graphs have the positive correlation I expected. Not sure why the top two graphs have an initial dip, though.

Anyway, it's still unclear how we can use this information to potentially curate the training set. Let's instead plot the number of fragments corresponding to each point on the previous plot, so we can see if there's a plasmid threshold that allows for a reasonable number of training fragments with a significant boost to data quality (as measured by replicate Spearman correlation).

[image: replicatecorrvsnumfragments] https://user-images.githubusercontent.com/17256372/28391021-26888c9c-6c91-11e7-9a0d-3663db246a57.png

Looks like no. The relationship is basically linear, so any improvement in correlation will require a linearly proportional loss in the amount of training data. So there's no obvious curation to be done with this dataset. An approach to weight training examples based on replicate variance might help here? @akundaje https://github.com/akundaje @pgreenside https://github.com/pgreenside do you think this would help at all?

I've starting training without any curation / weighting / augmentation, and I'll post an update about how that's going later today as well.

*The exact transformation is log2(fold change), and to avoid divide by zero errors, a pseudo read count of 1 is added to both RNA reads and DNA reads. Also, the number of RNA and DNA reads are divided by the total number of reads for the relevant experiment, to get a fraction read count. So the actual quantity is log2((RNA + 1) / sum(RNA reads)) - log2((DNA+1) / sum(DNA reads)). [Note that this difference is equivalent to doing a log of the quotient of the values inside each of the logs, I just wrote it that way for readability.]

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kundajelab/mpra/issues/4, or mute the thread https://github.com/notifications/unsubscribe-auth/AAI7EXD2Ivh6EOfQBYCe33DuyjbiGhByks5sPnmKgaJpZM4OdV2o .

rmovva commented 7 years ago

Yeah I'm currently doing the same thing Joe did (I talked to him and he sent me some of his work), which is computing the log fold change (RNA and DNA reads + pseudocount normalized by the total) and z-score normalizing each of the labels. Sharpr did this same normalization (except the z-norm).

What additional normalization needs to be done for RNAseq? I'm not totally familiar with the issues there. Would be great to discuss.

akundaje commented 7 years ago

Yeah let's discuss in person. Using the total reads as the normalization factor is a bad idea in RNASEQ. We need to check if that is the case here as well. There are other easy ways to estimate better normalization factors.

On Jul 19, 2017 6:58 PM, "Rajiv Movva" notifications@github.com wrote:

Yeah I'm currently doing the same thing Joe did (I talked to him and he sent me some of his work), which is computing the log fold change (RNA and DNA reads + pseudocount normalized by the total) and z-score normalizing each of the labels. Sharpr did this same normalization (except the z-norm).

What additional normalization needs to be done for RNAseq? I'm not totally familiar with the issues there. Would be great to discuss.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kundajelab/mpra/issues/4#issuecomment-316574003, or mute the thread https://github.com/notifications/unsubscribe-auth/AAI7ERGX64qOaRq8CwHzDUzj-QmWgoVnks5sPrQrgaJpZM4OdV2o .

rmovva commented 7 years ago

Initial performance on the Sharpr dataset is not looking great. I'm training on all chromosomes except for chr8 and chr18 (held out for validation and test respectively), so there are ~450K training examples. The labels are z-score normalized (this appears to be working a little better than just predicted log-fold change directly), and I'm doing direct multitasking on 12 different tasks: 4 biological conditions (2 promoters, 2 cell types), with 3 outputs each (replicate 1 + replicate 2 + average).

The model predictions are far less accurate than the replicate correlations. Namely, corr(rep1, rep2) = 0.47, 0.39, 0.37, 0.35 vs. corr(avg, y_pred) = 0.29, 0.10, 0.16, 0.16 (for the four conditions). I'm really not sure what's causing such a big disparity. With SuRE-seq, an out-of-the-box model with Basset architecture had correlation 0.30, and the replicate correlation was 0.34. So for some reason this dataset is much more difficult to learn. The Sharpr model is also overfitting quite a bit; at one point the average training predictions had correlation 0.38 (averaged across all tasks) with the true values compared to 0.12 for the validation set. Some of this accuracy can probably be explained by the tiled experimental setup. That is, many of the 145bp training examples share large parts of their sequence with other examples (since the 145bp fragments tile 295bp regions at 5bp intervals), making predictions easier for fragments that have already been partially seen. It's worth noting that this observation probably inflated the performance for Joe's model; they didn't use hold-out chromosomes for the val/test sets, making those predictions much easier than they really are.

Here are a few trivial thoughts I have to improve performance:

Here are some "bigger" ideas I have that may help more significantly (or not at all, as most of this is uncharted territory for me):

akundaje commented 7 years ago

Let's meet at noon today

Anshul

On Jul 20, 2017 9:02 AM, "Rajiv Movva" notifications@github.com wrote:

Initial performance on the Sharpr dataset is not looking great. I'm training on all chromosomes except for chr8 and chr18 (held out for validation and test respectively), so there are ~450K training examples. The labels are z-score normalized (this appears to be working a little better than just predicted log-fold change directly), and I'm doing direct multitasking on 12 different tasks: 4 biological conditions (2 promoters, 2 cell types), with 3 outputs each (replicate 1 + replicate 2 + average).

The model predictions are not close to the replicate correlation. Namely, corr(rep1, rep2) = 0.47, 0.39, 0.37, 0.35 vs. corr(avg, y_pred) = 0.29, 0.10, 0.16, 0.16. I'm really not sure what's causing such a big disparity. With SuRE-seq, an out-of-the-box model with Basset architecture had correlation 0.30, and the replicate correlation was 0.34. So for some reason this dataset is much more difficult to learn. The Sharpr model is also overfitting quite a bit; at one point the average training predictions had correlation 0.38 (averaged across all tasks) with the true values compared to 0.12 for the validation set. Some of this accuracy can probably be explained by the tiled experimental setup. That is, many of the 145bp training examples share large parts of their sequence with other examples (since the 145bp fragments tile 295bp regions at 5bp intervals), making predictions easier for fragments that have already been partially seen. It's worth noting that this probably inflated performance for Joe's model; they didn't use hold-out chromosomes for the val/test sets, making those predictions much easier than they really are.

Here are a few trivial thoughts I have to improve performance:

  • I'm currently running a hyperparameter search and that could help a little, but I imagine not significantly. Might add ~0.02 to the Spearman.
  • Reverse complement augmentation is probably also worth trying I imagine this will also ~0.02 (as it did for the SuRE dataset).
  • Use separable FC layers instead of standard ones.

Here are some "bigger" ideas I have that may help more significantly (or not at all, as most of this is uncharted territory for me):

  • Stop multitasking the different cell types and/or promoters. It definitely feels like different cell types could activate in response to different TFs / other cell-type specific signals, and only leaving the last layer open to tuning cell-type specifities seems pretty limiting. It's worth giving singletasking (each celltype+promoter pair) a try here, or use someo type of soft multitasking etc.
  • Transfer learning from accessibility models. @pgreenside https://github.com/pgreenside has sent me the weights for one of her universal accessibility models, though I haven't yet thought about how I'll adapt the weights from that architecture to use in my model architecture. I guess I can start by loading the conv filter weights from that model and training the dense layers on the MPRA data.
  • I think multitasking with ChIP-seq signal for a bunch of different important TFs with ENCODE data in HepG2 and K562 could be pretty helpful as well. That way there'd be a stronger force for the model to learn biologically interpretable filters (motifs). Or along those lines, I could try initializing some of the conv filters to known motif PWMs. Has filter initialization helped anyone in the past for binding / accessibility models? @jisraeli https://github.com/jisraeli maybe you've tried this before?
  • Add other feature tracks? Even though MPRAs are experimentally only dependent on sequence / gene expression, DNase or ATAC might add some cell-type specific information to boost predictions.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kundajelab/mpra/issues/4#issuecomment-316750187, or mute the thread https://github.com/notifications/unsubscribe-auth/AAI7EfKorDcEpQOMNZcdmOIouPPxQCdxks5sP3oNgaJpZM4OdV2o .

rmovva commented 7 years ago

Here are some ideas discussed with Anshul today:

  1. Normalization: Normalizing read counts by the total number of reads in the experiment is not very robust to outliers. Usually a few datapoints will dominate. Instead, use something like quantile normalization or DEseq's library size factor. I'll post some quantile normalized replicate-replicate scatter plots soon.
  2. Strong baseline: Usually the deep learning models won't be able to significantly outperform things like XGBoost, so train a quick baseline model using e.g. windowed motif hit features or a support vector regression with string kernel.
  3. Improving DL performance: A few different ideas.
    • The different tasks might require some very different features, especially HepG2 vs. K562 cell types. So try training w/o multitasking.
    • Freeze convolutions to known motif PWMs (though some transformation needs to be applied to go from PWM --> conv filter, since they are on different scales)
    • Transfer learn from accessibility models by freezing conv filter weights and finetuning in subsequent layers
    • Multitask MPRA output with ChIP / DNase
    • Weight examples based on replicate concordance
  4. CNN-RNN architecture: An RNN could help model the sequential nature of the tiled fragments. Basically, each example would be one of the 16K regions of 295bp, and there would be a bidirectional recurrent layer after 1-2 convolutional layers. The RNN would predict the MPRA output for each 145bp, and we could multitask with single bp resolution DNase values etc. Though deepLIFT would be difficult, we could still do ISM for interpretation here.
rmovva commented 7 years ago

These are the quantile normalized replicate-replicate plots. I wasn't 100% sure at which stage to quantile normalize (i.e. on raw counts or log-fold change?), but I did the following:

These replicate Spearman correlations are a little better on average. For the minP conditions the correlation goes down ~0.01, K562_SV40P goes up ~0.01, and HepG2_SV40P goes up 0.07.

@akundaje Other than Spearman correlation, what should I look at to determine if this normalization was better than dividing by total read count?

rmovva commented 7 years ago

I compared several different methods for normalization, and none of them seemed to significantly improve data quality. This suggests that the correlations are just biologically weak, and there's not much we can do to fix that. The goal, then, is to design models that approach that correlation.

I've decided to just go with quantile normalization of the DNA to RNA log-fold changes, as this seems to make the most logical sense. Here are plots of (1) the distributions of quantile-normed log-fold changes (note that the distributions are basically the same, as expected after quantile normalization) and (2) scatter plots of the quantile-normed log-fold changes for each replicate (as in the other plots, points are colored by input DNA read count - red is higher).

quantilenormedlfcdistributions

quantilenormedlfcjointplotreplicatecorrelations

Will be posting some results soon.

rmovva commented 7 years ago

Some updates on my work, as well as ideas/thoughts for the future:

  1. HOMER results on Sharpr: To confirm that there are learnable motifs in the Sharpr dataset, I ran HOMER using the top 20K activating sequences vs. 50K background sequences with lowest magnitude of activity (I also performed similar analyses using the 20K most repressive sequences, and each of these experiments was done in both K562 and HepG2 for both promoter types). Overall, things look pretty good and match prior knowledge. The top motifs for K562 are things like ETS, AP1, NRF, whereas the known liver TF HNF4G pops up at the top of the HepG2 results (with a reasonable % fraction of sequences containing it). So this step definitely confirms that learnable motifs are present in the dataset.

An example HOMER results txt file is attached (for the K562_minP activating sequences): knownResults.txt

  1. Deep learning models: I experimented with many different architectures (randomly searched space with number of conv layers ranging from 1 to 5, filter lengths ranging from 4bp to 20bp, n_filters ranging from 25 to 300, number of FC layers from 1 to 3, separable FC layers, etc.). What ultimately performed best was a pretty simple model: 3 conv layers with 120 filters each, filter lengths are all 5bp. No dense layers other than the final regression layer. The models are predicting 12 tasks: 2 cell types (K562, HepG2), 2 promoter types (minimal promoter, strong promoter 'SV40P'), and 3 "experiments" (replicate 1, replicate 2, average). The labels are z-normalized log-fold changes without any between-sample normalization, as this seemed to give optimal Spearman correlation (same Spearman as with quantile normalization). Average Spearman correlation across all tasks is 0.191, with very significant differences for the different tasks. Spearmans were as follows:

Averaging across the four tasks, the model captures about ~50% of the biological variance in the data.

I'm especially surprised at how much worse the K562 strong promoter predictions are is performing compared to K562 minimal promoter (this result does match Joe Paggi's paper; their reported MSE was 0.92 for this condition vs. ~0.7 for the other three conditions*). I imagine this has something to do with the recent findings of systematic error in the recent Stark lab MPRA paper. Specifically, in light of the first point in this paper, the models are likely having trouble predicting which transcribed fragments initiate from the ORI vs. from SV40P promoter. Even in cases where the initiation point is biologically consistent across replicates, the model does not have necessary information to learn that, so it essentially sees that variation across different fragments as noise. Whereas for minP, it is likely that all transcripts just initiated at the ORI, since minP is quite weak in comparison. The initiation of transcripts at ORI instead of SV40P could have unknown effects on read quantification.

*Note, their results seem much better but that's probably mostly because they didn't use a properly chosen validation set.

Btw @akundaje have you gotten a chance to email Alex Stark about releasing/sharing the processed data for the corrected HeLa MPRA?

  1. Optimizing deep models: Because the models are only achieving ~50% of the biological replicate Spearman correlation, I spent quite some time experimenting with different optimization techniques. The goal here was to try and get better predictions before moving into interpretation, so that interpretation would be more accurate. Here are the things I've tried and ideas I have yet to try. Suggestions are appreciated:

    • Reverse complement training set augmentation --> ~0.02 Spearman improvement.
    • Z-score normalized, raw log-fold changes as labels instead of total read normalized log fold changes as labels --> ~0.01 Spearman improvement.
    • Multitasking for each individual celltype/promoter condition instead of on all 12 labels (e.g. 3 tasks, K562_minP_rep1, K562_minP_rep2, K562_minP_avg) --> No Spearman improvement (went down ~0.02).
    • Multitasking for each individual celltype (e.g. 6 tasks, 3 for K562_minP and 3 for K562_SV40P) --> No Spearman improvement (went down ~0.01)
    • Pretraining MPRA model on Peyton's hematopoiesis accessibility data. To do this, I had to reduce the 1000bp inputs to 145bp inputs (extracted five 145bp intervals from each of the 1000bp sequences; all of the five 145bp sequences have the same labels as their parent 1000bp region), which hurt the performance of the model quite a bit. Also for accessibility models larger conv filters is probably better whereas I was using 5bp conv filters to match the shape of my model's weights. So after pretraining, the model was getting ~0.45 AUPRC as opposed to Peyton's ~0.60 AUPRC. Anyway I used these pretrained weights (excluding the final dense layer) to initialize my model. This led to no Spearman improvement (went down ~0.02). ^For this idea, Peyton mentioned that it might make sense to just freeze the conv filters from the accessibility model to use them as motif extractors. So in that case I would change my first layer to use 17bp conv filters. I can try this and see if it helps, but I would still imagine it wouldn't work amazingly b/c of the 145bp input sequence constraint.
    • Weighting each example by 1 / (replicate variance). Have not tried this yet; should be a priority.
    • Add some more tasks to multitask with. Probably high confidence binary labels for ChIP-seq peaks would make the most sense here (145bp seems just too short for accessibility, but I could add those labels as well). The only issue is how to simultaneously do classification and regression (this would also require two separate loss functions); there's probably a way to do this with the Keras functional model API, but I'm not sure it's worth the time investment. @akundaje thoughts?
    • Sequence-to-sequence hybrid CNN-RNN architecture. I mentioned this in one of my previous updates, here's a snippet: Basically, each example would be one of the 16K regions of 295bp, and there would be a bidirectional recurrent layer after 1-2 convolutional layers. The RNN would predict the MPRA output for each 145bp, and we could multitask with single bp resolution DNase values etc. Again, this would require a pretty large time investment and I'm not sure if it makes sense to work on now. Could revisit this down the road.
  2. First-Pass Interpretation: Peyton and I discussed some sanity checks to confirm that the models are learning relevant features. Here are some of those ideas:

  1. Looking Forward - Systematic Interpretation Analyses: I have several ideas for genome-wide analyses that we can perform using the per-base level deepLIFT scores, if the sanity checks mentioned in (4) work out. Basically every analysis that the Sharpr paper did with conservation, TF motifs, etc., we should be able to reproduce genome-wide. Adding on to that, we can look at things like variant scoring / QTL discovery. Another cool thing would be to test hypotheses on computationally designed enhancer sequences if we can first show that the per-base scores are functionally relevant.
akundaje commented 7 years ago

Great update. Will contact Stark. Btw SPI1 is not the expected primary motif in K562 enhancers. TAL1 and GATA motifs are better candidates. The Homer results on K562 are a bit strange that they don't highlight these motifs. Do you have the full tables?

Also for multitasking with TF Chipseq you can just convert them to regression problems as well using the peak scores.

akundaje commented 7 years ago

Btw are you still using quantile normalization on the RNA counts only. That doesn't make much sense. You should either do it on the log RNA/DNA or do the RNA and the DNA separately (but do both) and then obtain the log fold change

akundaje commented 7 years ago

You can try quantile normalizing the log counts

rmovva commented 7 years ago
  1. Yeah I might try looking in GATA / TAL peaks as well. Regarding HOMER, GATA was definitely enriched (p = 1e-34, in 2.66% of foreground sequences vs. 1.23% of background), though TAL was not (in ~10% of both the foreground and the background). I imagine if I had chosen a more stringent background (e.g. 10K sequences with lowest absolute value activity instead of 50K, or most repressive sequences) these motifs would show up higher. For GATA in particular, the Sharpr paper found that it was more active in K562 than HepG2, but there were other more active TFs in both cell types. TAL1 was not mentioned in the paper. In general, the HOMER results mostly agreed with the paper. I attached the full results for K562 minP in my previous post. I can put all the results (for HepG2 + repressive conditions as well) on a Google spreadsheet later today.

  2. For ChIP-seq multitasking, could I use average bigwig scores for the 145bp regions? Since some of the fragments won't overlap peaks. I could use the nearest peak score, but wouldn't averaging the per-base signal be better?

  3. For normalization, I'm currently just computing log-fold change (log RNA/DNA) for each replicate independently, and then z-score normalizing (there is no explicit between-sample normalization step). This setup results in the exact same Spearman correlation as quantile normalizing the log-fold changes across replicates; after all, both methods are affine transforms that preserve rank order. The z-normalization step also puts both replicates on the same distribution, like quantile normalization. Because of those reasons I don't see a major reason to prefer one method over the other. I ended up using log-fold changes without quantile normalization, since this allowed me to pool the replicates instead of averaging the LFCs. There's probably a way to do something similar with quantile normalization here (e.g. by quantile normalization the log counts as you mentioned), but it doesn't seem necessary/helpful.

rmovva commented 7 years ago

I've spent most of the last couple weeks working on posters/presentations as the SIMR program ended last Thursday, but another priority has been interpretation of the Sharpr models. Overall, things are looking quite promising - I am finding many activating TF motifs in the accurately-predicted fragments. In the coming days, I plan to do some more systematic analyses to further validate the models' importance scores and uncover regulatory grammars that the model discovered.

On that note, here are some notes of ideas I had:

  1. Run MODISCO with phenograph clustering method (I tried running MODISCO with k-means on the tSNE embedding of the seqlets - it did not look good. Aside: I'm thinking that maybe I should go back and look at the SuRE-seq models for more interpretation, since I mostly just ran through the MODISCO tutorial which uses tSNE - which I'm now learning is not a particularly effective method.)
  2. Look at overlap of model’s top deepLIFT nucleotides with CENTIPEDE annotations.
  3. For each TF motif, convolve its PWM with the deepLIFT tracks for each of the Sharpr sequences. Get the top PWM*deepLIFT convolution for each sequence -- so we know which motifs were responsible for each prediction. With this information we can look for the top TFs by cell type etc. This should give a similar plot as in the Sharpr-MPRA paper. Also, we can build a correlation matrix of TFs that should hopefully correlate with known cooperative TF-TF interactions.
  4. Heatmap with joint in-silico mutagenesis - this idea has been discussed in labmeetings a few times (e.g. Artur's HiC model, Daniel's GGR work)
  5. Look at correlation of deepLIFT scores with e.g. eQTLs / tfQTLs, but not (as significantly) dsQTLs / hQTLs, since the latter two are aspects of chromatin that aren't really present in MPRAs. (what about methylation QTLs? is methylation a factor for MPRA sequence constructs?)
akundaje commented 7 years ago

All the QTL analyses are going to only sorta work (mostly agreement of direction of effect rather than strong correlation of effects sizes) cuz all of those effects are in-vivo and very noisy. You may also want to compare to allele-specific effects for chromatin accessibility and histone marks at SNPs in peaks.

On Wed, Aug 9, 2017 at 3:41 PM, Rajiv Movva notifications@github.com wrote:

I've spent most of the last couple weeks working on posters/presentations as the SIMR program ended last Thursday, but another priority has been interpretation of the Sharpr models. Overall, things are looking quite promising - I am finding many activating TF motifs in the accurately-predicted fragments. In the coming days, I plan to do some more systematic analyses to further validate the models' importance scores and uncover regulatory grammars that the model discovered.

On that note, here are some notes of ideas I had:

  1. Run MODISCO with phenograph clustering method (I tried running MODISCO with k-means on the tSNE embedding of the seqlets - it did not look good. Relevant aside: I'm thinking that maybe I should go back and look at the SuRE-seq models for more interpretation, since I mostly just ran through the MODISCO tutorial which uses tSNE - which I'm now learning is not a particularly effective method.)
  2. Look at overlap of model’s top deepLIFT nucleotides with CENTIPEDE annotations.
  3. For each TF motif, convolve its PWM with the deepLIFT tracks for each of the Sharpr sequences. Get the top PWM*deepLIFT convolution for each sequence -- so we know which motifs were responsible for each prediction. With this information we can look for the top TFs by cell type etc. This should give a similar plot as in the Sharpr-MPRA paper. Also, we can build a correlation matrix of TFs that should hopefully correlate with known cooperative TF-TF interactions.
  4. Heatmap with joint in-silico mutagenesis - this idea has been discussed in labmeetings a few times (e.g. Artur's HiC model, Daniel's GGR work)
  5. Look at correlation of deepLIFT scores with e.g. eQTLs / tfQTLs, but not (as significantly) dsQTLs / hQTLs, since the latter two are aspects of chromatin that aren't really present in MPRAs. (what about methylation QTLs? is methylation a factor for MPRA sequence constructs?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kundajelab/mpra/issues/4#issuecomment-321158132, or mute the thread https://github.com/notifications/unsubscribe-auth/AAI7Ef2foo5ova33XY52WUuoIw7gnDSFks5sWUaBgaJpZM4OdV2o .

rmovva commented 7 years ago

Another area I wanted feedback on was how to improve the model's performance, particularly in predicting the extreme values in the dataset.

The first scatter plot here are the two biological replicates for the K562 minP task (all the data is z-transformed). While the data is certainly not clean, the two replicates at least show a wide dynamic range. The second scatter plot, however, shows predictions (y-axis) vs. experimental values (x-axis; averaged across the two replicates). The Spearman correlation is significantly lower than the correlation of the biological replicates, and more importantly, the model does a very poor job of recapitulating the entire activity range of the MPRA sequences. All of the predictions lie from approx. 0 to 2; the model is unable to accurately predict any of the repressive sequences or any of the strongest activator sequences.

image

image

The most obvious diagnosis for the issue here is that the majority of the data lies between 0 to 2, so the model can achieve good performance by just optimizing for the predictions of those sequences. Perhaps a solution to this is to upweight the extreme examples. I wonder if that would decrease the Spearman correlation?

Another possible explanation is that the repressive sequences in particular are pretty inconsistent across replicates. To help with this issue, I tried assigning weights to the examples proportional to sigmoid(1 / (rep1 - rep2)). However this did not improve the Spearman correlation nor visibly improve the model's dynamic range. But I have not played around with this thoroughly.

I might also just try multitasking with e.g. REST ChIPseq to force the model to learn repressive motifs.

It would be great to hear other potential reasons / solutions for this issue of the model not accurately predicting repressor and strong activator sequences.

I've also been trying some XGBoost models on this dataset, using k-mer counts for k = 1,2,3,4,5,6 as the features (4^1 + 4^2 + ... + 4^6 = 5460 features in total). This performs nearly as well as deep learning, with Spearman of approx 0.30. Additionally, this model does slightly better at predicting the dynamic range of values. XGBoost's good performance suggests that decision trees could perform pretty well for MPRAs -- so I may want to try out Anshul's idea of training a small number of trees on e.g. deepLIFT score features, to see of the trees can learn some cool, interpretable grammars.

akundaje commented 7 years ago

The ranking loss that Johnny has been using is a good fit for this problem. There u sample pairs of points and learn to rank them correctly. The nice thing about this is you can sample pairs of points in a way that their relative are reliably well separated rather than just focusing on all possible pairs. You can also weigh the loss of misranking a pair according to how reliable their relative ranks are. Talk to him about these loss functions.

-A

On Wed, Aug 9, 2017 at 3:58 PM, Rajiv Movva notifications@github.com wrote:

Another area I wanted feedback on was how to improve the model's performance, particularly in predicting the extreme values in the dataset.

The first scatter plot here are the two biological replicates for the K562 minP task (all the data is z-transformed). While the data is certainly not clean, the two replicates at least show a wide dynamic range. The second scatter plot, however, shows predictions (y-axis) vs. experimental values (x-axis; averaged across the two replicates). The Spearman correlation is significantly lower than the correlation of the biological replicates, and more importantly, the model does a very poor job of recapitulating the entire activity range of the MPRA sequences. All of the predictions lie from approx. 0 to 2; the model is unable to accurately predict any of the repressive sequences or any of the strongest activator sequences.

[image: image] https://user-images.githubusercontent.com/17256372/29106623-ddde87e4-7c8a-11e7-88f8-9bbe0c5893be.png

[image: image] https://user-images.githubusercontent.com/17256372/29106797-db6f5686-7c8b-11e7-9a1c-3aac5997f7c4.png

The most obvious diagnosis for the issue here is that the majority of the data lies between 0 to 2, so the model can achieve good performance by just optimizing for the predictions of those sequences. Perhaps a solution to this is to upweight the extreme examples. I wonder if that would decrease the Spearman correlation?

Another possible explanation is that the repressive sequences in particular are pretty inconsistent across replicates. To help with this issue, I tried assigning weights to the examples proportional to sigmoid(1 / (rep1 - rep2)). However this did not improve the Spearman correlation nor visibly improve the model's dynamic range. But I have not played around with this thoroughly.

It would be great to hear other potential reasons / solutions for this issue of the model not accurately predicting repressor and strong activator sequences.

I've also been trying some XGBoost models on this dataset, using k-mer counts for k = 1,2,3,4,5,6 as the features (4^1 + 4^2 + ... + 4^6 = 5460 features in total). This performs pretty well, with Spearman of approx 0.30. Additionally, this model does slightly better at predicting the dynamic range of values. XGBoost's good performance suggests that decision trees could perform pretty well at this task -- so I may want to try out Anshul's idea of training a small number of trees on e.g. deepLIFT score features, to see of the trees can learn some cool, interpretable grammars.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kundajelab/mpra/issues/4#issuecomment-321160506, or mute the thread https://github.com/notifications/unsubscribe-auth/AAI7EYn0csdARh2EOBh3hT0GRPnZVmJ_ks5sWUqOgaJpZM4OdV2o .

akundaje commented 7 years ago

Another thing is to project the quantiles of the observed labels onto a gaussian i.e. u basically quantile normalize against a standard normal distribution. This type of distribution is usually easier for square loss to learn.

On Wed, Aug 9, 2017 at 4:10 PM, Anshul Kundaje anshul@kundaje.net wrote:

The ranking loss that Johnny has been using is a good fit for this problem. There u sample pairs of points and learn to rank them correctly. The nice thing about this is you can sample pairs of points in a way that their relative are reliably well separated rather than just focusing on all possible pairs. You can also weigh the loss of misranking a pair according to how reliable their relative ranks are. Talk to him about these loss functions.

-A

On Wed, Aug 9, 2017 at 3:58 PM, Rajiv Movva notifications@github.com wrote:

Another area I wanted feedback on was how to improve the model's performance, particularly in predicting the extreme values in the dataset.

The first scatter plot here are the two biological replicates for the K562 minP task (all the data is z-transformed). While the data is certainly not clean, the two replicates at least show a wide dynamic range. The second scatter plot, however, shows predictions (y-axis) vs. experimental values (x-axis; averaged across the two replicates). The Spearman correlation is significantly lower than the correlation of the biological replicates, and more importantly, the model does a very poor job of recapitulating the entire activity range of the MPRA sequences. All of the predictions lie from approx. 0 to 2; the model is unable to accurately predict any of the repressive sequences or any of the strongest activator sequences.

[image: image] https://user-images.githubusercontent.com/17256372/29106623-ddde87e4-7c8a-11e7-88f8-9bbe0c5893be.png

[image: image] https://user-images.githubusercontent.com/17256372/29106797-db6f5686-7c8b-11e7-9a1c-3aac5997f7c4.png

The most obvious diagnosis for the issue here is that the majority of the data lies between 0 to 2, so the model can achieve good performance by just optimizing for the predictions of those sequences. Perhaps a solution to this is to upweight the extreme examples. I wonder if that would decrease the Spearman correlation?

Another possible explanation is that the repressive sequences in particular are pretty inconsistent across replicates. To help with this issue, I tried assigning weights to the examples proportional to sigmoid(1 / (rep1 - rep2)). However this did not improve the Spearman correlation nor visibly improve the model's dynamic range. But I have not played around with this thoroughly.

It would be great to hear other potential reasons / solutions for this issue of the model not accurately predicting repressor and strong activator sequences.

I've also been trying some XGBoost models on this dataset, using k-mer counts for k = 1,2,3,4,5,6 as the features (4^1 + 4^2 + ... + 4^6 = 5460 features in total). This performs pretty well, with Spearman of approx 0.30. Additionally, this model does slightly better at predicting the dynamic range of values. XGBoost's good performance suggests that decision trees could perform pretty well at this task -- so I may want to try out Anshul's idea of training a small number of trees on e.g. deepLIFT score features, to see of the trees can learn some cool, interpretable grammars.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kundajelab/mpra/issues/4#issuecomment-321160506, or mute the thread https://github.com/notifications/unsubscribe-auth/AAI7EYn0csdARh2EOBh3hT0GRPnZVmJ_ks5sWUqOgaJpZM4OdV2o .

rmovva commented 7 years ago

Yeah, I was talking to Johnny about his ranking loss techniques for this purpose today. He suggested an interesting setup that I might want to try in the future, but it seems nontrivial to implement and so I don't want to spend too much time on that now.

Your second suggestion sounds good -- I think you're the third person that's now recommended this to me :) I'll try it and report back.

rmovva commented 7 years ago

So I tried projecting the labels onto the standard normal, and it didn't seem to increase the dynamic range of the prediction distribution. Didn't really improve Spearman correlation either.

I might try upweighting the extreme datapoints so the model prioritizes them more. But yeah it seems like there's just a fundamental problem with using a square loss since the model can just predict near the mean and do pretty well.

image