Open zrcjessica opened 9 months ago
Hey @zrcjessica, no this is exactly right! I have noticed this recently and have a new version coming up this week that fixes this.
This will effect the coordinates released as a part of the .h5 predictions object in chrombpnet test
and wont effect anything else.
Hello,
I've been going over the code base and it looks to me like
self.coords
fromChromBPNetBatchGenerator()
appears to represent the region start and the region summit for the peak and nonpeak sequences, respectively, after the random crop is applied to the input sequence. In the initialization of the ChromBPNetBatchGenerator objects, the first line of code calls theload_data()
function from thedata_utils.py
script. This then callsget_coords()
which appears to define the center of each region (both peak or nonpeak) as the start + summit coordinates from the BED file; i.e. the literal center of the region. However, the final line of code in the__init__()
function for the generator callscrop_revcomp_data()
which then applies therandom_crop()
function fromaugment.py
to the peak sequences. This is where they define thenew_coords
after randomly cropping a sequence of lengthinputlen
(2114bp) from the peak sequence region with jitter applied (3114bp) as:It looks like a new start coordinate is being defined from what was originally a center coordinate. Even when the mode is set to
test
andmax_jitter
is subsequently set to 0, this same transformation is still being applied.However, for nonpeak sequences, only the
subsample_nonpeak_data()
function is applied, which simply randomly subsamples from the original coords as returned byget_coords()
, which reflect the center of the region.Therefore, when batches of data returned by the generator, it seems to me like the
batch_coords
representing peak sequences will reflect the start coordinates while the coords representing nonpeak sequences will reflect the center coordinates of the region.It's possible I've overlooked something in the code, but based on my own manipulations of the data this still appears to be the case. It might be helpful to add a disclaimer about this so that users realize the discrepancy and deal with it accordingly.