When will the version supporting indels come out? Why input 17-bp sequence when you train the model with 11bp sequence?

yangyxt commented 4 years ago

Instead of inputting a 17-bp sequence flanking an SNV, can we build another format to suit for Indels?

Can we input a table of two columns while the left column storing WT sequences and the right column storing MUT sequences?

I thought you are just counting the 6-mer haplotypes and implement the count data into the linear regression to make the prediction?

vincentiusmartin commented 4 years ago

QBiC doesn’t support indel since it computes linear combination of the OLS coefficient estimates for all 6-mers overlapping SNP of interest. This leads the need to use 11bp sequece to get all 6-mers. However, we also use PBM E-score to give the binding status change prediction (e.g. the SNP change the TF from bound > unbound) which requires 8-mers model overlapping the SNP—this leads to 17bp sequence input. Unfortunately, right now, we don’t have a way to support indels.

Right now, we don’t have the feature for two columns WT and MUT. But we support text file that contains 17-mer DNA sequences with the "mutated from" nucleotide in the middle and the "mutated to" nucleotide on the right, separated by a space or tab character. See example here: qbic.genome.duke.edu/download/QBiC-sequence-format-example-ELK1_17mers.txt . However, this shouldn’t be hard to implement, I will mark this as a TODO.

Yes, we are counting the 6-mers and using the count with the OLS coeffcient to make the prediction. All 6-mers overlap the SNP of interest which is the reason we can only calculate TF binding changes from SNPs.

More responses below from our lab PI: When will the version supporting indels come out?

We do not have a date for this. While adapting our current OLS approach to indels seems simple conceptually, there are statisical and practical issues that one needs to consider. E.g. when computing the difference between sums of beta coefficients, in the case of indels we have different numbers of 6mers to consider for the two variants; there are ways to deal with this issue, but we haven't explored and tested them enough.

Why input 17-bp sequence when you train the model with 11bp sequence?

As Vincentius replied, we use the 17-bp sequences to make predictions about changes in binding status, while the 6-mer OLS models (for which 11mers suffice) are used to assess the quantititave change in TF binding. We note that while the OLS models are great at predicting quantitative changes, they are sometimes also sensitive enough to predict changes in non-specific binding, which may not be of interest to users. From the OLS models alone, it is not possible to compute universal cutoffs/criteria for specific vs. non-specific binding. But this can be done using universal PBM data, and in particular PBM 8-mer E-scores. We know, from negative control experiments, that an E-score cutoff of 0.4 is a good criteria for "specific" TF binding, as we rarely see E-scores of 0.4 or above in PBM experiments for non-DNA binding proteins. Similarly, a sequence with all 8-mer E-score < 0.35 can be reasonably considered "non-specific", again according to independent PBM experiments where we compare such sequences to random DNA or to known non-specific sites. We are preparaing a manuscript on binding site calling that will have more details on these criteria.

Instead of inputting a 17-bp sequence flanking an SNV, can we build another format to suit for Indels?

The problem with indels is not really about the format. As mentioned above, we haven't thoroughly tested the OLS approach for indels yet.

Can we input a table of two columns while the left column storing WT sequences and the right column storing MUT sequences?

In theory, yes, this should be possible. Vincentius added it to the TODO list.

I thought you are just counting the 6-mer haplotypes and implement the count data into the linear regression to make the prediction?

I'm not sure what you mean by "implement the count data into the linear regression". In the linear regression, we are indeed using the counts of 6-mers. We learn their coeffcients, and then use those coefficients (and their variances) to make predictions.

yangyxt commented 4 years ago

Thanks for the info! But I'm a bit confused now.

The model is trained on 6-mers to get coefficients and when we input WT and MUT sequences, it summarizes the difference in counts of all 6-mers in an 11-bp window and stores them into a vector c, which leads to delta S.

Then, using the distribution of normalized intensities across all the observations(BTW, how you defined a single observation? several neighborhood probes correspond to one fluorescent signal spot, one signal spot can be treated as one observation because the output signal intensity is certain?) to calculate the p-value and z-score.

The E-score, according to the article you published in 2006(Compact, universal DNA microarrays to comprehensively determine transcription-factor binding site specificities), is rank-based. I did not get the whole details of how it is calculated but I get how you rank the 8-mers by looking at Figure2.a in your 2006 publication. Using the count of k-mers with E-scores >0.45 to determine whether an L-bp sequence can bind with a given TF.

Therefore, the 6-mer generated delta S is just used to calculate the z-score and p-value to show the confidence of the change while the abs value of delta S is not used for classification. Instead, with the training PBM data, you know each 8-mer's E-score given a TF. With the input of variants, you can fetch 17bp sequence with the SNV at the middle from the hg19/hg38 ref genome. And you know the changes of the counts of each 8-mer with E-score. Based on that, you can classify the 17bp sequence bount/unbound to a TF.

If what I stated above is correct to you, why use two different systems to calculate p-value and make classifications, respectively. Why not train data with 8-mer in a 17bp window ?

Really appreciate if you can respond to this. Thanks!

vincentiusmartin commented 4 years ago

The model is trained on 6-mers to get coefficients and when we input WT and MUT sequences, it summarizes the difference in counts of all 6-mers in an 11-bp window and stores them into a vector c, which leads to delta S. Correct.

Then, using the distribution of normalized intensities across all the observations(BTW, how you defined a single observation? several neighborhood probes correspond to one fluorescent signal spot, one signal spot can be treated as one observation because the output signal intensity is certain?) to calculate the p-value and z-score. Indeed, the normalized intensities over all probes are used to compute the regression coefficients and their variances, which are then used to calculate p-values and z-scores for linear combinations of coefficients. The way these universal PBM arrays work is that each 60-mer correspons to one spot on the array. That spot contains millions of identical DNA molecules with that particular 60-mer sequence. When we scan the array, we get a small number of pixels for that spot, and we take the median pixel intensity as the measurement for that spot. So one spot is treated as one observation, for one particular 60-mer sequence. (I do not understand what you mean by "neighborhood probes".) The signal represents the level of bound TF, and it actually correlates very well with independent Kd measurements.

The E-score, according to the article you published in 2006(Compact, universal DNA microarrays to comprehensively determine transcription-factor binding site specificities), is rank-based. I did not get the whole details of how it is calculated but I get how you rank the 8-mers by looking at Figure2.a in your 2006 publication. Using the count of k-mers with E-scores >0.45 to determine whether an L-bp sequence can bind with a given TF. The E-score is indeed rank-based. This makes it robust and comparable between universal PBMs, even between different proteins. (The signal intensities are not directly comparable between experiments.) A simple way to think about the E-score is to consider it as the area under the ROC curve, minus 0.5 (this is not entirely correct, but it's a good approximation). Different E-score cutoffs may be used for different purposes.

Therefore, the 6-mer generated delta S is just used to calculate the z-score and p-value to show the confidence of the change, while the abs value of delta S is not used for classification. correct. DeltaS cannot be used for classification. The sign of deltaS can be used to illustrate whether the binding increases or decreases, but not to make calls about "specific" or "non-specific" sites.

Instead, with the training PBM data, you know each 8-mer's E-score given a TF. With the input of variants, you can fetch 17bp sequence with the SNV at the middle from the hg19/hg38 ref genome. And you know the changes of the counts of each 8-mer with E-score. Based on that, you can classify the 17bp sequence bount/unbound to a TF. Quick clarification: for the 8-mer E-scores we do not use the counts of 8-mers. We use the actual E-scores.

If what I stated above is correct to you, why use two different systems to calculate p-value and make classifications, respectively. Why not train data with 8-mer in a 17bp window ? At a high level, making calls of specific versus non-specific sites (i.e. classification) and predicting quantitatively the affinity of putative binding sites are two distinct problems that oftentimes require different approaches. As mention above, the E-scores are comparable between experiments, and for this reason we can use control PBM experiments with non-DNA binding proteins to determine good cutoffs for "specific" and "non-specific" binding. On the other hand, the E-scores are not very good at predicting TF binding quantitively, so for this task we use regression (here, OLS). The reason we are not using 8-mers in OLS, but we are using 6-mers instead, is due to overfitting of OLS models with 8-mers (please see our RECOMB paper referenced form the QBiC website).

vincentiusmartin / QBiC-Pred

When will the version supporting indels come out? Why input 17-bp sequence when you train the model with 11bp sequence? #9