urbanairship / datacube

Multidimensional data storage with rollups for numerical data
http://urbanairship.com
Apache License 2.0
265 stars 62 forks source link

Consider autoflush=false for HBaseBackfillMerger #27

Open timrobertson100 opened 12 years ago

timrobertson100 commented 12 years ago

HBaseBackfillMerger does not seemingly make use of disabling autoflush. I can't see anything in site-xml suggesting it can be disabled by default so wonder if there is a design reason for this. We get a 10x increase in insert rate when auto flush is disabled and on a 6B record cube (Google tiles at 23 zooms for all species) this is crippling.

Before I propose a pull request, I thought I'd just ask.

drevell commented 12 years ago

@timrobertson100, I think the issue here is that we don't want to delay writes by the HBaseBackfillMergeMapper. The way cubes are merged is by reading all three input cubes (recalculated, snapshot, and live cubes), determining the new value for the live cube based on those three inputs, then writing the live cube. There is an obvious race condition here: if the live cube value is updated between the read and write, then the update will be lost. By flushing the write immediately, we limit the time window where this race condition causes problems.

There could be other opportunities for optimizing the HBaseBackfillMergeMapper. Have you experimented with increasing the number of concurrent map tasks? We might even consider running multiple threads inside a single map task. This would help if the bottleneck is round-trip HBase latency.