kite-sdk / kite

Kite SDK
http://kitesdk.org/docs/current/
Apache License 2.0
394 stars 263 forks source link

KITE-1083: Add single jars to DistCache for Hive. #419

Closed rdblue closed 8 years ago

rdblue commented 8 years ago

This changes the way jobs submitted by the CLI are configured. Previously, the entire lib directory for Hive was added to the distributed cache. This caused long job and task startup times and exposed some conflicting jar problems. This commit updates the setup so that individual jars are added for classes needed for interacting with the Hive MetaStore. In cases where the job is local or the job isn't interacting with Hive, this doesn't add Hive dependencies to the distributed cache at all.

rdblue commented 8 years ago

Notes:

tomwhite commented 8 years ago

This seems very brittle. If a dependency changes then it won't be obvious that TransformTask needs to be updated.

rdblue commented 8 years ago

I agree that this is brittle, but we need to connect to the metastore in setup and commit tasks. Adding the Hive lib directory was worse because clusters that used parcels ended up including the entire jars directory of the parcel. That ended up causing version conflicts and causing the jobs to take forever because so many jars were added in the distributed cache.

We can try to come up with a better solution, but for now I know this works and is better than the old method. We should be able to catch dependency changes with integration testing.

tomwhite commented 8 years ago

I haven't got a better suggestion, so I agree that it's OK to go in. Could you add a comment to the POM dependencies that points to this class so anyone changing it know to look there too.

rdblue commented 8 years ago

Good idea, I will.