jghoman / haivvreo

Hive + Avro. Serde for working with Avro in Hive
Apache License 2.0
59 stars 27 forks source link

Identify overhead in Haivvreo performance versus unencoded data #16

Open dkarvounis opened 12 years ago

dkarvounis commented 12 years ago

I find that queries on Avro data in Hive consistently take 3 to 4 times longer than on the same data in CSV format. As I understand, Haivvreo/Avro should be faster.

I have ensured that: -The number of mappers/reducers is the same in both cases. -The performance difference persists whether the Avro data is compressed (Deflate) or uncompressed. -The performance difference persists whether the Avro data consists of many small files or one large file.

Do benchmarks exist comparing queries using Haivvreo on Avro data versus Hive queries on unencoded data? If Haivvreo takes longer, the cause of the overhead should be identified. Should Avro be faster?

jghoman commented 12 years ago

3x-4x is not great, but not terribly surprising. The benchmarks we did were informal and along then lines of 'is this fast enough for our purposes' and it was. That being said, there's lots of room for improvement but very little time to get it done... I'm already quite indebted in terms of promises to get things done on the code.

dkarvounis commented 12 years ago

I've identified one speedup particular to my situation. My tables consist of data from multiple dumps, where the schema of each incoming Avro file has a string identifying that dump in the "doc" field. The variation in this doc field caused the reader and writer schemata to evaluate to unequal most of the time in AvroDeserializer.deserialize(), and so reencoding was forced even when schemata were otherwise equivalent. I tried a more relaxed schema equality method (ignoring doc), and this sped up queries ~30% on data with otherwise equivalent schemata. I understand that one's expectations of .equals() come down to one's particular situation and you'd probably want to use the .equals() provided by the API by default, but I thought I'd make this scenario known.

jghoman commented 12 years ago

That's a good optimization. With Hive 8 I added a hook to allow external serdes to populate the comments field, so we'll actually be able to see the avro comments from Hive. I need to update Haivvreo to use this new API.