Open vitkl opened 1 year ago
The BPNet model itself should work on any length. Pay attention to the trimming
parameter, which specifies the difference between the input length and the output length (or, specifically, half that difference because it's the amount to trim off either side). However, I may have gotten lazy in other parts of the code-base and assumed 2114 input or 1000 output. Let me know if that's the case anywhere and I'll fix it.
Thanks for clarifying!
What are the requirements for the input sequence length? Can I change trimming to use 1000 input and 1000 output?
If you set trimming to 0 you should be able to use the same input and output length. However, you'll likely get worse predictions on the flanks because they can't see their full contexts.
I would like to use this as a bias model - how much context does the bias model need to see?
Usually, the bias model is given far less context so that it does not inadvertently learn complex rules. We usually use a model with four layers, so the residual layers aggregate 2**4 nucleotides after the first layer.
Thanks for clarifying! Trimming 16/2 nucleotides probably doesn't matter too much.
What are the bias model first layer filter size and FCLayers dimensionality?
Look through this example specification: https://github.com/jmschrei/bpnet-lite/blob/master/example_jsons/chrombpnet_pipeline_example.json
I am not sure where I should looks for a full list of bias model parameters. I see this https://github.com/jmschrei/bpnet-lite/blob/master/example_jsons/chrombpnet_pipeline_example.json#L34-L41 but it only mentions the number of layers.
Also interesting to see that BPNet doesn't use any normalisation layers (eg LayerNorm/BatchNorm). I wonder if not using those normalisation layers is a pre-requisite for learning TF concentration-dependent effects.
The bias model is just a BPNet model so you can use any of the parameters in https://github.com/jmschrei/bpnet-lite/blob/master/example_jsons/bpnet_fit_example.json
I'm not sure how normalization would relate to learning TF concentration-dependent effects. Presumably, that's static in each experiment.
Presumably, that's static in each experiment.
This makes sense. How do you motivate not using normalisation?
Presumably, the motivation is that adding it in didn't help, empirically, and keeping the model conceptually simpler can help with interpretation.
Makes sense!
Looking at the file with options:
I'd recommend having it be as large as possible while still capturing the peak. Basset would have benefitted from a larger jitter because it had position dependence in the model through the use of dense layers. BPNet is much more resistant to this because it only uses convolution layers.
https://www.biorxiv.org/content/10.1101/103663v1 suggests the best approach is to train the model and then, at inference time, scan both. I don't think it matters too much though.
Do you mean that BPNet should in principle works as well with a smaller shift?
I saw the paper - however, I have to use the summation strategy in the rest of the model - so the question is whether the bias model should be similarly trained.
Many of the choices for BPNet were done on a small number of ChIP-nexus data sets. I don't think we've rigorously tested each decision by, for example, looking at performance with different jitters. I think that jittering helps BPNet but don't have more than just that intuition.
The bias model and subsequently the full ChromBPNet model should be trained by randomly RCing input. If you want to force outputs to be identical across strands then, at inference time, you run a sequence and its RC through the entire model (bias and accessibility) and aggregate.
I see. I observed that 3bp jittering helps cell2state compared to no jittering. I will test 100bp jittering.
I see what you mean. I would like to use BPNet as a bias model but the biological model is very different. I am using parameter-sharing architecture in cell2state CNN because the refererence-based TF motifs will be only recognised in either RC or FW so the model has to look at both directions and aggregate. As far as I understand, random RC forces the model to learn both FW and RC forms of the TF motifs (or in this case bias motifs) - does this map to your intuition? The main question is whether it's fine to randomly RC input for the bias model but use both FW/RC for the biological model.
Randomly RCing encourages the model to learn motifs in both directions but there's no guarantee that it learns the same effect in both directions, even though biologically that's plausible. There have been RARE cases where it learns a motif in one direction and not in the other, but this is usually only when there are not a lot of binding sites.
I'm not sure I understand how you're training your biological model. Are you training it jointly with a frozen pre-trained bias model? I guess my feeling is that both should be trained the same way. If you end up doing parameter sharing with your model, you should train the bias model with parameter sharing. Otherwise, it might learn some weird effects. Unfortunately, parameter sharing is not implemented in bpnet-lite.
I see. The model failing to learn a motif in both directions makes sense.
Also makes sense to train both models with parameter sharing. The bpnet-lite code helps a lot with understanding the architecture. I will try implementing parameter sharing but I am not sure I fully understand how to do that for layer 2+. My biological architecture uses just one CNN layer, scans forward DNA with FW filter, then with RC filter (simply swap complementary nucleotides and flip
width axis), applies Exp activation to each, and then adds the result of scanning with FW/RC. Does the same operation happen in all subsequent layers?
If you want to do parameter sharing I'd recommend having a wrapper that takes in a normal model, runs your sequence through it, then runs the RC'd sequence through it, and then flips the output from the RC'd sequence and averages it. No need to modify the underlying model.
Thanks for this suggestion @jmschrei! This certainly simplifies the implementation.
Didn't Anshul's group previously show that this kind of conjoining during training is worse than both conjoining during evaluation and a model that correctly accounts for the symmetries in every layer?
Actually, this won't work because for forward and RC sequences to give the same results, the model needs to represent both the forward and RC versions of each TF motif: -A> <A-
Which is fine if you train a CNN de-novo but not fine if you use TF motifs as the first layer CNN weight priors.
Hi @jmschrei
does BPNet code support arbitrary sequence length? https://github.com/jmschrei/bpnet-lite/blob/master/bpnetlite/bpnet.py