Unsorted input can result in > 100x slower performance

hoffmangroup / genomedata

The Genomedata format for storing large-scale functional genomics data.

https://genomedata.hoffmanlab.org/

GNU General Public License v2.0

2 stars 1 forks source link

Unsorted input can result in > 100x slower performance #30

Open EricR86 opened 7 years ago

EricR86 commented 7 years ago

Original report (archived issue) by Coby Viner (Bitbucket: cviner2, GitHub: cviner).

Archive creation using unsorted tracks can result in vastly reduced performance.

In one representative case, generation of a Genomedata archive for GRCh37/hg19 with a single 17 MiB BEDGraph file (consisting of data for 669 222 loci) took in excess of a month (job terminated, prior to completion) when the file was not sorted, but completed within ~ 9 hours when the file was sorted.

Given this large disparity in performance I suggest that files be automatically sorted on input. Alternatively, a warning should be emitted for small unsorted files and large unsorted files should be rejected with an error message.

EricR86 commented 7 years ago

Original comment by Eric Roberts (Bitbucket: ericr86, GitHub: ericr86).

How specifically did you sort your input? Do you know if there's a difference if you sort by chr only or region only? Presumably you sorted by both?

I'm elevating this issue because such a discrepency in time should not exist for a fundamental adjustment to loading in data.

EricR86 commented 7 years ago

Original comment by Eric Roberts (Bitbucket: ericr86, GitHub: ericr86).

changed priority from "minor" to "major"
changed kind from "bug" to "enhancement"

EricR86 commented 7 years ago

Original comment by Coby Viner (Bitbucket: cviner2, GitHub: cviner).

I did indeed only sort by both (via bedtools sort). I did not conduct any further testing.

I do not think that this is particularly surprising, given the nature of the HDF5 archive construction. I believe unsorted data causes the same super-contigs to be repeatedly opened and closed, resulting in substantial overhead. This appeared to be corroborated by my log files for the unsorted data, which simply consisted of repeated instances of the usual per-region reading and writing operations (chr[1-22XY], [...], allocating memory for \d+ floats, reading \d+ floats... done, ..., writing \d+ floats... done).

EricR86 commented 7 years ago

Original comment by Coby Viner (Bitbucket: cviner2, GitHub: cviner).

changed state from "new" to "open"

EricR86 commented 7 years ago

Original comment by Eric Roberts (Bitbucket: ericr86, GitHub: ericr86).

Sorted datasets should be documented for now and a longer term fix would be simply to print a warning if the "chr" labels are out of order.