kundajelab / chrombpnet

Bias factorized, base-resolution deep learning models of chromatin accessibility (chromBPNet)
https://github.com/kundajelab/chrombpnet/wiki
MIT License
124 stars 34 forks source link

Discrepancy between coords self.coords between peak and nonpeak sequences #180

Open zrcjessica opened 9 months ago

zrcjessica commented 9 months ago

Hello,

I've been going over the code base and it looks to me like self.coords from ChromBPNetBatchGenerator() appears to represent the region start and the region summit for the peak and nonpeak sequences, respectively, after the random crop is applied to the input sequence. In the initialization of the ChromBPNetBatchGenerator objects, the first line of code calls the load_data() function from the data_utils.py script. This then calls get_coords() which appears to define the center of each region (both peak or nonpeak) as the start + summit coordinates from the BED file; i.e. the literal center of the region. However, the final line of code in the __init__() function for the generator calls crop_revcomp_data() which then applies the random_crop() function from augment.py to the peak sequences. This is where they define the new_coords after randomly cropping a sequence of length inputlen (2114bp) from the peak sequence region with jitter applied (3114bp) as:

new_coords = coords.copy()
new_coords[:,1] = new_coords[:,1].astype(int) - (seqs.shape[1]//2) + starts

It looks like a new start coordinate is being defined from what was originally a center coordinate. Even when the mode is set to test and max_jitter is subsequently set to 0, this same transformation is still being applied.

However, for nonpeak sequences, only the subsample_nonpeak_data() function is applied, which simply randomly subsamples from the original coords as returned by get_coords(), which reflect the center of the region.

Therefore, when batches of data returned by the generator, it seems to me like the batch_coords representing peak sequences will reflect the start coordinates while the coords representing nonpeak sequences will reflect the center coordinates of the region.

It's possible I've overlooked something in the code, but based on my own manipulations of the data this still appears to be the case. It might be helpful to add a disclaimer about this so that users realize the discrepancy and deal with it accordingly.

panushri25 commented 9 months ago

Hey @zrcjessica, no this is exactly right! I have noticed this recently and have a new version coming up this week that fixes this.

panushri25 commented 9 months ago

This will effect the coordinates released as a part of the .h5 predictions object in chrombpnet test and wont effect anything else.