BasePairModel Questions

smwhite7 commented 2 years ago

10/27/21

can we use variant-containing sequences as test data?
profile vs counts information? answer: probability output, total counts, if multiple two then come up with per base counts, ask zahoor what profile output is supposed to be and if not in documentation add pull request to be added
threshold for determining count/profile change? - threshold used for SNPs?

twang15 commented 2 years ago

11/04/2021, Shannon

I want to know what measurements are outputted, what data type this is, and are the outputs a summary of all of the test data or do we also get the individual values for each region entered through the test data. In the tutorial they discuss computing “importance scores” - it would be good to know exactly how that’s calculated.

We are interested in measuring binding strength on a per base resolution for the DNA fragments given in the test data, so those outputs should be before the discover motif step.

11/04/2021, Jay

the general structure of the model and the outputs of basepairmodels CLI, like what the functions do and how they work together. Since i haven’t done any indel prediction yet,

I don’t mean very specific details but e.g. function1() reads the data function2() is for regularization

10/17/2021, Jay

Hi @zahoor_zafrulla, I want to clarify a few points:

Basepairmodels models (like BPNet models) on ChIP-seq data are trained on bigwig files and BED files, both of which are based on some standard reference genome (i.e. bigwigs are from alignments to hg19 or hg38, BED coordinates are hg19 or hg38). Am I understanding correctly?
For my workflow, the sequence of events would be (for the basepairmodels tutorial on github):
- [For sequences WITHOUT variants] Train basepairmodel model on ENCODE3 CTCF bigwigs (based on BAM files containing hg38 alignments) --> prediction on hg19 sequences (that contain indels, each hg19 sequence corresponds w/ a indel-containing sequence)
- [For sequences WITH variants] As for the run for sequences w/o variants, train model on the CTCF data --> prediction on indel-containing sequences
- [Note] I’m not sure if testing hg19 sequences on a model trained on hg38 data is appropriate
- [Question] Is the train-test-validation split ratio the same as the one for BPNet (60%/20%/20%)? Is this something I need to pay close attention to?
My understanding is that the CLI for basepairmodels does not generate plots and that I need to write code to make them. Is there a framework I could follow (e.g. Jupyter notebook code for basepairmodels, “De-novo sequence scanning” section of the colab notebook code for BPNet)? I know that the model needs to be imported but on the Jupyter notebook it seems many imports are required.
If I’m not mistaken, the actual ChIP-seq signal prediction of the hg19 sequences or indel-containing sequences happens in the IDE (not on the CLI)? (e.g. an enhancer sequence, defined as a string, is used in the BPNet colab notebook)
- During our meeting we discussed how the model provides a 2x2114 matrix (logits units) and a 2x1 matrix (log of sum of read counts of profile). My understanding is that these matrices are generated for each hg19 or indel-containing sequence I use in the IDE (not CLI)?
To verify, the “Compute metrics” part of the basepairmodels github page refers to the test set prediction metrics (test set is from the ENCODE CTCF ChIP-seq data)?
Are the “Importance scores” and “Discover motifs with TF-modisco” sections on the github page of immediate relevance to testing sequences on the IDE? I’m guessing that importance scores are incorporated into the imported model already but am not sure about TF-modisco.

twang15 commented 2 years ago

10/14/2021, Tao

Hi Zahoor, in 1.3, there is "mv peaks.bed ENCSR000EGM/data", is it the one downloaded in 1.2 from ENCODE portal?
- Answer: yes, that’s correct
Another question: in the following figure, where to download the three files (chroms.txt, hg38.genome.fa, hg38.genome.fai)? https://github.com/kundajelab/basepairmodels/blob/master/docs-build/tutorial/bpnet/images/directory-reference.png
- chroms.txt is just file with a list of all chromosomes chr1 chr2 . . chr22 chrX chrY
- you can download h38.fa from here https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/
- and use samtools faidx to create the .fai file

10/27/2021, Tao

Hi Zahoor, could you point to us the document/formula for calculating the predicted profiles and read counts?
- Answer from Zahoor: Here the counts are the predicted counts: https://github.com/kundajelab/basepairmodels/blob/7a39caebcaf7c9758ae8dd097466ef9b39c5ac49/basepairmodels/cli/logits2profile.py#L139

11/10/2021, Tao

To compute the metrics, the command looks like this:

metrics -A [path to profile training bigwig] -B [path to profile predictions bigwig] --peaks ENCSR000EGM/data/peaks.bed --chroms chr1 --output-dir ENCSR000EGM/metrics --apply-softmax-to-profileB --countsB [path to exponentiated counts predictions bigWig] --chrom-sizes reference/hg38.chrom.sizes

Questions:

In this metrics command, what is profile training bigwig or profile predictions bigwig
And also “exponentiated counts predictions bigWig”?

twang15 commented 2 years ago

11/12/2021, Tao, Shannon, Jay

model_split000_task0_plus.wig: logit_value, this is base pair specific value

model_split000_task0_plus_exponentiated_counts.wig: counts, total counts of the entire profile, N, it is the entire area under peak profile curve

Reference: https://github.com/kundajelab/basepairmodels/blob/7a39caebcaf7c9758ae8dd097466ef9b39c5ac49/basepairmodels/cli/logits2profile.py#L139

Q: how to transform base pair-specific logit_value to base pair-specific signal value

    # Reference: https://github.com/kundajelab/basepairmodels/blob/7a39caebcaf7c9758ae8dd097466ef9b39c5ac49/basepairmodels/cli/logits2profile.py#L139
    # scale logits: first softmax, then multiply by counts
    probVals = logits_vals - logsumexp(logits_vals)  # logP_i
    probVals = np.exp(probVals) # P_i
    profile = np.multiply(counts_vals, probVals) # N*P_i

convert the logit_value into probability • softmax: from vector of logit_values to vector of probabilties • input: [1, 2, 3, 4] o [e^1, e^2, e^3, e^4] o sum = e^1 + e^2 + e^3 + e^4 • output: [e^1/sum, e^2/sum, e^3/sum, e^4/sum] o sum(output) = 1

probVals = logits_vals – log[sumexp(logits_vals)]

x = log(e^x), logits_vals = log(e^ logits_val)

logits_vals – log[sumexp(logits_vals)] = log(e^ logits_vals) - log[sumexp(logits_vals)] = log { (e^ logits_vals)/ sumexp(logits_vals)}

probVals = np.exp(probVals)

e^( log { (e^ logits_vals)/ sumexp(logits_vals)}) = (e^ logits_vals)/ sumexp(logits_vals)})

calculate the signal value • N* output

twang15 commented 2 years ago

Questions, Tao, 11/16/2021

Predict
- what is the input for predict? how to specify?
- can we use input a string of sequence?
- what are the outputs?
- why the input sequence length is 2114 and longer than output profile length (1000)?
- How strangeness is conveyed to the model? and how it is represented in the output?

twang15 commented 2 years ago

Meeting with Vivek, Shannon, Jay, and Tao, 11/18/2021

control: the models are trained with controls. But for our exploration, there is no control. Vivek confirms that it is fine to provide a control with all zero as signal values.
The API call:
- we thought the input was a sub-sequence, but a closer look into the bpnet code shows that the sub-sequence has to be preprocessed in order to be used as the input to API predict()
- - step 1. identify the target motifs (w/ or w/o snp/indels)
- - step 2. construct the input sub-sequence by placing the motif sequence at the center of a 2114-bp reference genome sequence.
- - step 3. one hot encode the input sub-sequence
- - step 4. prepare the batch array by providing the one-hot encoding and the all-zero control
- - step 5. call model.predict() with inputs (one-hot encoded sub-sequence and all zero control as np array in a batch)
- - step 6. extract the profile and counts for both strands

Screen Shot 2021-11-12 at 2.42.58 PM.pdf

padding: this is the reason why the input 2114 bps. Padding could be either GC sequences of real motif peak region with the target motif sequence at the center, i.e.,
- take a sequence of peak region
- shuffle the sequence to get a template
- insert the motif sub-sequence into the template in the center
- one-hot encode
- predict
- perform the above steps N (500+) times, and average all the counts and signals with different peak sequences

OR, we will perform the padding in the following way, because we want to keep the context of SNP/Indels so that we can see the influence of SNP/indels on the neighboring context regions.

all the other steps are the same, except the template is the context region of SNP/indels in the reference genome. The motif w/ or w/o SNP/indels should be placed in the middle of the template as stated above

2114 as the input sequence length and 1000 as the output sequence length. These are the magic numbers of the trained models, we will stick to these numbers in our exploration.
strandness: regardless of whether --strand is specified or not, there will always be the prediction results for both strands. We can pick one of them if that makes more sense to us.
TODOs
- Vivek will set up a separate folder to share the models on SCG and let us know.
- Tao will code up the preprocessing/post-processing of the prediction.

twang15 / K562-Analysis