kite-sdk / kite

Kite SDK
http://kitesdk.org/docs/current/
Apache License 2.0
394 stars 263 forks source link

Running the kite-sdk commands in mapreduce mode #426

Open malathit opened 8 years ago

malathit commented 8 years ago

Hi,

I had a look at the kite dataset code and found that kite internally uses apache crunch to run map reduce pipeline.

In my case, I invoke the kite cli from oozie to import the json data. But I noticed that by default, the apache crunch program is running mapreduce in LocalRunner mode. If I want to run the program in distributed mapreduce mode, how do I achieve that?

Regards, Malathi

rdblue commented 8 years ago

Kite will use MR on the cluster if both source and destination datasets are distributed. So Local to HDFS uses the local runner, while HDFS to Hive uses MR.

malathit commented 8 years ago

Hi,

Thanks for the reply. In my case, I am using the data in hdfs to be written to the hive dataset created by hive. But still the program runs as localrunner. Any ideas if I have missed something obvious?

rdblue commented 8 years ago

What is the command you're running? If you don't specify hdfs:/... then Kite assumes you mean local. So if you run hdfs -put file.csv and then run kite-dataset csv-import file.csv ... Kite will find and use the local version instead of the one you just put in HDFS. You have to use the full URI like this: kite-dataset csv-import hdfs:/user/me/file.csv ...