jghoman / haivvreo

Hive + Avro. Serde for working with Avro in Hive
Apache License 2.0
59 stars 27 forks source link

Havvireo performance with Snappy vs. gzipped text #15

Closed elibingham closed 12 years ago

elibingham commented 12 years ago

Hi there,

I wanted to do a simple benchmark of Hive/Avro/Snappy vs. Hive using straight gzipped text. Surprisingly I found that Haivvreo imposed about a 100% overhead, though I'm not certain what the bottleneck is (possibly Avro overhead?). I was wondering if you could comment.

Specifically, using a small cluster of four nodes (with a 64 MB block size) and a dataset of CSV files totalling about 1.9 mm records over 339 megabytes uncompressed over 11 files, I did the following:

1) Transformed the 11 CSVs into a simple flat Avro representation, stored as binary avro files compressed using the Snappy codec. Compression ratio was a little better than 5:1. 2) Gzipped the 11 CSVs into 11 gzipped text files (getting about a 10:1 compression ratio). Note that the resulting gzipped CSVs were in the 2 to 5 MB range. 3) Created a haivvreo backed table A using the 11 avro.snappy files. 4) Created a gzipped text backed table B using the 11 gzipped CSVs. 5) Ran select count() from A - result returned in 36 seconds. 6) Ran select count() from B - result returned in 14 seconds. 7) Ran select a, count(_) as a_count from A group by a order by acount desc limit 10; - result returned in 56 seconds. 8) Ran select a, count() as a_count from B group by a order by a_count desc limit 10; - result returned in 26 seconds.

This is surprising to me, since Snappy is supposed to offer a performance advantage in (de)compression. Is my benchmark flawed, or is Avro (or Haivvreo) a potential culprit here?

Thanks!

jghoman commented 12 years ago

I would bet on Avro being the problem. Re-building instances of data takes a while. It would be interesting to pull out the Hive/Haivvreo part of the test and just benchmark reading those files from hdfs as text files and via Avro. My expectation would be you'd see close to the same amount of overhead. I don't expect Hive to add much once the tasks get to the task trackers.

mpouttuclarke commented 11 years ago

@elibingham: can you please give the complete source code and the Avro schema you were using?