kite-sdk / kite-examples

Kite SDK Examples
Apache License 2.0
99 stars 70 forks source link

CDK-421: Use CLI in demo #4

Closed rdblue closed 4 years ago

tomwhite commented 10 years ago

Looks good overall. Did you run through the whole example? :)

rdblue commented 10 years ago

Yes, I did run through the whole example. Even the oozie part! I think we should separate that one out into its own example that demonstrates Ben's recent work.

tomwhite commented 10 years ago

I tried running through the demo example, but hit a problem when trying to create the dataset:

dataset create sessions --schema hdfs://localhost:8020/user/$USER/schemas/session.avsc                           --directory /tmp/data
ls: /Users/tom/sw/hcatalog-0.5.0-cdh4.4.0/bin/../share/hcatalog/hcatalog-core-[0-9]*.jar: No such file or directory
WARNING: Cannot configure the Hive classpath!
Try setting HIVE_HOME to fix this warning
ls: /Users/tom/sw/hcatalog-0.5.0-cdh4.4.0/bin/../share/hcatalog/hcatalog-core-[0-9]*.jar: No such file or directory
Picked up _JAVA_OPTIONS: -Djava.awt.headless=true
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/thrift/TException
     at org.kitesdk.data.hcatalog.HCatalogMetadataProvider.getHcat(HCatalogMetadataProvider.java:50)
     at org.kitesdk.data.hcatalog.HCatalogMetadataProvider.exists(HCatalogMetadataProvider.java:86)
     at org.kitesdk.data.hcatalog.HCatalogExternalMetadataProvider.create(HCatalogExternalMetadataProvider.java:75)
     at org.kitesdk.data.spi.filesystem.FileSystemDatasetRepository.create(FileSystemDatasetRepository.java:126)
     at org.kitesdk.cli.commands.CreateDatasetCommand.run(CreateDatasetCommand.java:75)
     at org.kitesdk.cli.Main.run(Main.java:131)
     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
     at org.kitesdk.cli.Main.main(Main.java:183)
     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
     at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
     at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
     at java.lang.reflect.Method.invoke(Method.java:597)
     at org.apache.hadoop.util.RunJar.main(RunJar.java:208)
Caused by: java.lang.ClassNotFoundException: org.apache.thrift.TException
     at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
     at java.security.AccessController.doPrivileged(Native Method)
     at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
     at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
     at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
     at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
     ... 13 more  

I installed Hadoop, Hive and HCatalog locally (from 4.4.0 tarballs at http://archive-primary.cloudera.com/cdh4/) and set HADOOP_HOME, HIVE_HOME, and HCAT_HOME, and I got this error which seems to suggest that the HCat install from a tarball isn't working.

All of the examples can be run from the user's machine or on the VM, so we need to think about how to support that here. To run from the user's machine, the Hadoop components will need to be installed (e.g. from tarballs on Mac) which wasn't the case before, since the Kite Maven Plugin provides the relevant libraries. As it stands setting up the CLI to run from the user's machine is more work than the existing Maven plugin, so we should think about how to improve that before merging this. Could we create a version of the CLI that bundles the relevant JARs like the Maven plugin does?

rdblue commented 10 years ago

Could we create a version of the CLI that bundles the relevant JARs like the Maven plugin does?

I think what we need is to bundle the dependencies for hadoop profiles so that users can download the client bundle for their cluster or vm. Including these in the CLI jar increases its size by 10x. This also means coming up with a scheme to find and use the bundled dependencies. If we want to wait on updating the example until this is done, I'm fine with that. We should also remove the HCatalog dependency entirely, like in your branch, for 0.15.0.

tomwhite commented 10 years ago

Including these in the CLI jar increases its size by 10x.

I'm not sure this is a problem.

This also means coming up with a scheme to find and use the bundled dependencies.

We could include them in the JAR.

If we want to wait on updating the example until this is done, I'm fine with that.

Yes, I think that is the best way forward.

We should also remove the HCatalog dependency entirely, like in your branch, for 0.15.0.

Agreed. This is https://issues.cloudera.org/browse/CDK-423