Changing sequence input length - Githubissues

kundajelab / chrombpnet

Bias factorized, base-resolution deep learning models of chromatin accessibility (chromBPNet)

https://github.com/kundajelab/chrombpnet/wiki

MIT License

116 stars 32 forks source link

Changing sequence input length #112

Closed snaqvi1990 closed 1 year ago

snaqvi1990 commented 1 year ago

Hello,

I am curious how much you guys have played with input length (down to, say, 500 bp rather than the default of 2114 bp). If I am reducing the input length, do I need to re-train a new bias model with the same input length (I am currently using the pre-trained bias model provided).

Thanks, Sahin

panushri25 commented 1 year ago

Yes if you want to reduce the input length, you would want to make this change for the bias model as well.

But why would you want to do that? Using 500 bp sequence you will only be able to predict about 200-250bp of the ATAC-seq peaks. And the width of the peaks is usually bigger than this.

panushri25 commented 1 year ago

You can ofcourse get around by cropping your bias model output - if you have a bias model trained at 2114 bp length and 1000 bp output length - you can just crop the 1000bp output.

snaqvi1990 commented 1 year ago

Thanks, makes sense. I wanted to decrerease the input length because we have some pre-called features that are over peaks defined at 500bp, so I wanted to see the exact correspondance with ChromBPNet at that length. I did not know that the input length needs to be ~2x the desired length of predicted ATAC sequence - why is this? Is this because the reverse complement of of the input sequence is also included (i.e. 2114 bp means 1057bp forward + reverse)?

panushri25 commented 1 year ago

I will give you an intuitive explanation (but you can also work this out based on the architecture).

Its because the models receptive field is at 1000 (with dilation layers etc). So to make predictions e.g. at the edge of 1000 bp region, you need ~1000 bp sequence surrounding it. which means you need additional ~500 bp flanks for the sequence - so you dont use 1000bp sequence to predict 1000 bp profile, but you use 2114 bp sequence to predict 1000 bp profile.

panushri25 commented 1 year ago

Based on the receptive field you can determine how much sequence you need. And receptive field is a function of how deep your network is.

panushri25 commented 1 year ago

Closing this due to inactivity, please feel free to open this if you have any more questions.

snaqvi1990 commented 1 year ago

Sorry for the delay - I ended up trying this by re-training a bias model with input length 1200 and output length 500. It seemed to train just fine. However, when I now try to train the chrombpnet model with input length 1200 and output length 500 (all other parameters default), I get the below error. Any idea what's going on?

File "/oak/stanford/groups/pritch/users/naqvi/scripts_new/conda_inst/envs/chrombpnet2/lib/python3.8/site-packages/chrombpnet/training/train.py", line 84, in main model, architecture_module=get_model(args, parameters) File "/oak/stanford/groups/pritch/users/naqvi/scripts_new/conda_inst/envs/chrombpnet2/lib/python3.8/site-packages/chrombpnet/training/train.py", line 24, in get_model model=architecture_module.getModelGivenModelOptionsAndWeightInits(args, parameters) File "/oak/stanford/groups/pritch/users/naqvi/scripts_new/conda_inst/envs/chrombpnet2/lib/python3.8/site-packages/chrombpnet/training/models/chrombpnet_with_bias_model.py", line 104, in getModelGivenModelOptionsAndWeightInits bpnet_model_wo_bias = bpnet_model(filters, n_dil_layers, sequence_len, out_pred_len) File "/oak/stanford/groups/pritch/users/naqvi/scripts_new/conda_inst/envs/chrombpnet2/lib/python3.8/site-packages/chrombpnet/training/models/chrombpnet_with_bias_model.py", line 70, in bpnet_model assert cropsize>=0

panushri25 commented 1 year ago

yes the model has 8 dilation layers, with receptive length 1000 so its expecting the full 2114 base-pair sequence for chrombpnet

panushri25 commented 1 year ago

Can you tell me why you want to do it with 1200 input length I can maybe help you figure out how to get there ... I think you mentioned the following previously

I wanted to decrerease the input length because we have some pre-called features that are over peaks defined at 500bp, so I wanted to see the exact correspondance with ChromBPNet at that length.

What do you mean by exact correspondence with ChromBPNet at that length? How are you planning to use these features or compare them with ChromBPNet?

snaqvi1990 commented 1 year ago

I am trying to fine tune the ChromBPNet model (trained on ATAC as usual) to predict these features over 500 bp peaks. I have been able to fine-tune the 2114 input length ChromBPNet model for this task, but it does not have good performance. So I would like to see if I can do better by using a ChromBPNet model that trains on a smaller input and output length (the exact input length is not particularly important, just something that makes sense for predicting 500bp output)