hoffmangroup / genomedata

The Genomedata format for storing large-scale functional genomics data.
https://genomedata.hoffmanlab.org/
GNU General Public License v2.0
2 stars 1 forks source link

find_chunks in _load_seq.py does not end a chunk early at the end of a supercontig #43

Open EricR86 opened 6 years ago

EricR86 commented 6 years ago

Original report (archived issue) by Eric Roberts (Bitbucket: ericr86, GitHub: ericr86).


Currently, Genomedata does not index missing data greater than MIN_GAP_LEN.

However, if the ending of a supercontig is completely full of NaNs, this data will be indexed regardless of length. In the extreme case a supercontig could start with a single datapoint and contain only remaining NaNs and the chunk start and end would contain the entire region even if the region was far greater than MIN_GAP_LEN.

This results in Genomedata reporting large empty regions if the "chunk_starts/ends" attributes are used at the beginning and ending of supercontigs.

EricR86 commented 6 years ago

Original comment by Eric Roberts (Bitbucket: ericr86, GitHub: ericr86).


EricR86 commented 6 years ago

Original comment by Eric Roberts (Bitbucket: ericr86, GitHub: ericr86).


EricR86 commented 6 years ago

Original comment by Eric Roberts (Bitbucket: ericr86, GitHub: ericr86).


EricR86 commented 6 years ago

Original comment by Michael Hoffman (Bitbucket: hoffman, GitHub: michaelmhoffman).


What do you mean, it "does not index" this data?

EricR86 commented 6 years ago

Original comment by Eric Roberts (Bitbucket: ericr86, GitHub: ericr86).


The chunk_starts and chunk_ends genomedata/hdf5 attributes are not updated. The attributes get updated when gaps greater than MIN_GAP_LEN are found. No "gaps" are detected at the beginning or end of a supercontig since Genomedata looks between already existing datapoints.

EricR86 commented 6 years ago

Original comment by Eric Roberts (Bitbucket: ericr86, GitHub: ericr86).


After a discussion, the following solution was proposed: