kundajelab / chrombpnet

Bias factorized, base-resolution deep learning models of chromatin accessibility (chromBPNet)
https://github.com/kundajelab/chrombpnet/wiki
MIT License
115 stars 31 forks source link

Regions larger than 1000 bp #147

Closed ArnovanHilten closed 11 months ago

ArnovanHilten commented 11 months ago

Hi @panushri25

I really like chrombpnet and it is working great! I have one question regarding the output and the input bed files.

If I understand the code correctly the networks predicts on a region of 1000 bp. These are also saved in the h5 output with pred_bw and contribs_bw. When I was preparing my input files I was therefore surprised to find a filtered.peaks.bed with a 10th column that had a value higher than 1000.

Is it not necessary to have the 10th column (summit) to be smaller than 1000 (I guess around 500)? Are regions that have a start and end that are more than 1000 bp apart automatically split? I could not find this in the code but maybe I failed to find it. Are these peaks currently evaluated as a single region of 500 pbs around the summit instead of the whole region?

Thank you for developing such a great tool!

Best,

Arno

PS.

THe documentation for models is a bit confusing to me:

chrombpnet.h5 This is the bias factorized chromBPNet model that trains on the observed accessibility. This model is the combination of bias_model_scaled.h5 and chrombpnet_nobias.h5 chrombpnet_nobias.h5 TF-Model i.e model to predict bias corrected accessibility profile

Just to be sure; chrombpnet_nobias, is that the model without the bias model part? or is it the model that provides predictions without bias? If I understand it correctly chrombpnet.h5 is the full (final) model.

panushri25 commented 11 months ago

Thank you for the encouraging words!

When evaluating the peaks are centered on the summit column. It is possible for a peak to have a summit at greater than 1000 bp or have multiple summits.

panushri25 commented 11 months ago

in your question above - chrombpnet_nobias, is that the model without the bias model part? or is it the model that provides predictions without bias? I think both questions are pointing to the same thing, this is the model with out the bias part and hence provides predictions without bias. Hope that make sense, let me know otherwise I will try to clarify better.

panushri25 commented 11 months ago

yes chrombpnet.h5 is the full (final) model.

panushri25 commented 11 months ago

Hope this answers your question. Happy to answer any more questions you may have, please reply back here. I will close this as addressed for now.

ArnovanHilten commented 11 months ago

Thank you for your response, it is clear now. However, I think that it would be good to provide a warning or a message if the input region is larger than 1000 bp. From the description given, it is not super clear that only the region 500 bp around the summit column is evaluated (and that the first columns are not used).

In my case, I had to merge some peaks because it threw an error because the BigWig was not accepting regions that were not ordered (I had overlapping regions). When I merged them with bedtools I was surprised to see only 1000 bp predictions for these larger regions.

How do you handle this yourself? Do you only evaluate the 1000 bp for larger regions and will this not give a bias?