jghoman / haivvreo

Hive + Avro. Serde for working with Avro in Hive
Apache License 2.0
59 stars 27 forks source link

consolidated commit of column selective changes #25

Open svemuri opened 12 years ago

svemuri commented 12 years ago

This is a consolidated commit of column selective deserialization changes. In case you have looked at the code changes before, here is a brief description of the differences from that version

  1. A unit test which exercises the optimization (TestColSelectiveSerde.java)
  2. A Hive level functional test for query variants including JOINS (colselectivetest.q). I have tested it out on Hive8.
  3. Created a.gitignore file to let git ignore directories like target. dist as candidates for files to be checked in.
    1. In terms of the actual code changes, a table property (haivvreo.colselective) can be used to turn off the optimization by setting it to FALSE. it defaults to TRUE. This property can be configured using the TBLPROPERTIES feature of Hive and I have tested it out as well.
    2. An existing property "hive.io.file.readcolumn.ids" is used to drive the optimization i.e identify the set of columns requested.
    3. Since the code which walks the columns in AvroDeserializer requires that the columns are sorted, and do not contain duplicates, generateColArray method in AvroSerDe.java ensures that this condition is satisfied even if the source property supplied by Hive does not confirm to this condition.
    4. There are timers which track deserialization time and number of records optimized which serve to verify if it took the optimized code path. These messages are printed using LOG.info(). I have kept them because I have found them to be useful during testing or when running performance experiments.