find_chunks in _load_seq.py does not end a chunk early at the end of a supercontig

hoffmangroup / genomedata

The Genomedata format for storing large-scale functional genomics data.

https://genomedata.hoffmanlab.org/

GNU General Public License v2.0

2 stars 1 forks source link

find_chunks in _load_seq.py does not end a chunk early at the end of a supercontig #43

Open EricR86 opened 6 years ago

EricR86 commented 6 years ago

Original report (archived issue) by Eric Roberts (Bitbucket: ericr86, GitHub: ericr86).

Currently, Genomedata does not index missing data greater than MIN_GAP_LEN.

However, if the ending of a supercontig is completely full of NaNs, this data will be indexed regardless of length. In the extreme case a supercontig could start with a single datapoint and contain only remaining NaNs and the chunk start and end would contain the entire region even if the region was far greater than MIN_GAP_LEN.

This results in Genomedata reporting large empty regions if the "chunk_starts/ends" attributes are used at the beginning and ending of supercontigs.

EricR86 commented 6 years ago

Original comment by Eric Roberts (Bitbucket: ericr86, GitHub: ericr86).

Edited issue description

EricR86 commented 6 years ago

Original comment by Eric Roberts (Bitbucket: ericr86, GitHub: ericr86).

changed priority from "minor" to "major"

EricR86 commented 6 years ago

Original comment by Eric Roberts (Bitbucket: ericr86, GitHub: ericr86).

Edited issue description

EricR86 commented 6 years ago

Original comment by Michael Hoffman (Bitbucket: hoffman, GitHub: michaelmhoffman).

What do you mean, it "does not index" this data?

EricR86 commented 6 years ago

Original comment by Eric Roberts (Bitbucket: ericr86, GitHub: ericr86).

The chunk_starts and chunk_ends genomedata/hdf5 attributes are not updated. The attributes get updated when gaps greater than MIN_GAP_LEN are found. No "gaps" are detected at the beginning or end of a supercontig since Genomedata looks between already existing datapoints.

EricR86 commented 6 years ago

Original comment by Eric Roberts (Bitbucket: ericr86, GitHub: ericr86).

After a discussion, the following solution was proposed:

At telomeric regions, the chunk boundaries should start/end at first/last occurrence of data.
Between supercontigs, chunks should have gaps if the length between supercontigs is greater than MIN_GAP_LEN.