hoffmangroup / segway

Application for semi-automated genomic annotation.
http://segway.hoffmanlab.org/
GNU General Public License v2.0
13 stars 7 forks source link

Windows with multiple Genomedata archives are erroneously chosen with missing data #117

Closed EricR86 closed 6 years ago

EricR86 commented 6 years ago

Original report (BitBucket issue) by Eric Roberts (Bitbucket: ericr86, GitHub: ericr86).


Segway selects windows from the first Genomedata archive it encounters.

This is okay if archives do not contain missing data or are equal in terms of regions where they are missing data, or if regions of missing data are excluded for all cases across all archives.

In the extreme case, this is a very problematic issue if the first archive contains significantly more missing data than the rest. This will create large gaps in window selection potentially excluding regions of data from other archives from both being trained on or being annotated.

I believe the correct solution is to merge all windows from all archives for use as the working window set.

EricR86 commented 6 years ago

Original comment by Michael Hoffman (Bitbucket: hoffman, GitHub: michaelmhoffman).


+1

EricR86 commented 6 years ago

Original comment by Eric Roberts (Bitbucket: ericr86, GitHub: ericr86).


After some investigation it seems this behavior only occurs in gaps of missing data in a Genomedata archive where the gap is greater than 'MIN_GAP_LEN' (100 000) defined in '_load_seq.py' in Genomedata. When the archive is closed, gaps smaller than that region are merged or otherwise marked as a "chunk" start/end boundary.

EricR86 commented 6 years ago

Original comment by Michael Hoffman (Bitbucket: hoffman, GitHub: michaelmhoffman).


To get around this would Segway only need to scan the chunk attributes for all the Genomedata archives, right?

EricR86 commented 6 years ago

Original comment by Eric Roberts (Bitbucket: ericr86, GitHub: ericr86).


Yes, I believe this is probably the best way of fixing this issue.

EricR86 commented 6 years ago

Original comment by Eric Roberts (Bitbucket: ericr86, GitHub: ericr86).


Fixed in Pull Reqest 82. Changset 234fb637f404b6766e45b9b28f6b872431ff1972.