Open smwhite7 opened 2 years ago
I want to know what measurements are outputted, what data type this is, and are the outputs a summary of all of the test data or do we also get the individual values for each region entered through the test data. In the tutorial they discuss computing “importance scores” - it would be good to know exactly how that’s calculated.
We are interested in measuring binding strength on a per base resolution for the DNA fragments given in the test data, so those outputs should be before the discover motif step.
I don’t mean very specific details but e.g. function1() reads the data function2() is for regularization
Hi @zahoor_zafrulla, I want to clarify a few points:
To compute the metrics, the command looks like this:
metrics -A [path to profile training bigwig] -B [path to profile predictions bigwig] --peaks ENCSR000EGM/data/peaks.bed --chroms chr1 --output-dir ENCSR000EGM/metrics --apply-softmax-to-profileB --countsB [path to exponentiated counts predictions bigWig] --chrom-sizes reference/hg38.chrom.sizes
Questions:
model_split000_task0_plus.wig: logit_value, this is base pair specific value
model_split000_task0_plus_exponentiated_counts.wig: counts, total counts of the entire profile, N, it is the entire area under peak profile curve
Q: how to transform base pair-specific logit_value to base pair-specific signal value
# Reference: https://github.com/kundajelab/basepairmodels/blob/7a39caebcaf7c9758ae8dd097466ef9b39c5ac49/basepairmodels/cli/logits2profile.py#L139
# scale logits: first softmax, then multiply by counts
probVals = logits_vals - logsumexp(logits_vals) # logP_i
probVals = np.exp(probVals) # P_i
profile = np.multiply(counts_vals, probVals) # N*P_i
probVals = logits_vals – log[sumexp(logits_vals)]
x = log(e^x), logits_vals = log(e^ logits_val)
logits_vals – log[sumexp(logits_vals)] = log(e^ logits_vals) - log[sumexp(logits_vals)] = log { (e^ logits_vals)/ sumexp(logits_vals)}
probVals = np.exp(probVals)
e^( log { (e^ logits_vals)/ sumexp(logits_vals)}) = (e^ logits_vals)/ sumexp(logits_vals)})
OR, we will perform the padding in the following way, because we want to keep the context of SNP/Indels so that we can see the influence of SNP/indels on the neighboring context regions.
10/27/21