Closed EricR86 closed 6 years ago
Original comment by Eric Roberts (Bitbucket: ericr86, GitHub: ericr86).
After some investigation it seems this behavior only occurs in gaps of missing data in a Genomedata archive where the gap is greater than 'MIN_GAP_LEN' (100 000) defined in '_load_seq.py' in Genomedata. When the archive is closed, gaps smaller than that region are merged or otherwise marked as a "chunk" start/end boundary.
Original comment by Michael Hoffman (Bitbucket: hoffman, GitHub: michaelmhoffman).
To get around this would Segway only need to scan the chunk attributes for all the Genomedata archives, right?
Original report (BitBucket issue) by Eric Roberts (Bitbucket: ericr86, GitHub: ericr86).
Segway selects windows from the first Genomedata archive it encounters.
This is okay if archives do not contain missing data or are equal in terms of regions where they are missing data, or if regions of missing data are excluded for all cases across all archives.
In the extreme case, this is a very problematic issue if the first archive contains significantly more missing data than the rest. This will create large gaps in window selection potentially excluding regions of data from other archives from both being trained on or being annotated.
I believe the correct solution is to merge all windows from all archives for use as the working window set.