kite-sdk / kite

Kite SDK
http://kitesdk.org/docs/current/
Apache License 2.0
394 stars 263 forks source link

Hive Dataset as external table with HDFS Dataset #461

Open kchen0x opened 7 years ago

kchen0x commented 7 years ago

I create a dataset at HDFS with schema and partition:

kite-dataset create dataset:hdfs://10.0.1.63:8020/user/pnda/PNDA_datasets/datasets/kafka/depa_raw --schema sensorRecord.avsc --partition-by partition.json

and use Gobblin to continuously ingest data from kafka to HDFS. The partition looks like:

[
  {"type": "identity", "source": "src", "name": "source"},
  {"type": "year",     "source": "timestamp"},
  {"type": "month",    "source": "timestamp"},
  {"type": "day",      "source": "timestamp"},
  {"type": "hour",     "source": "timestamp"}
]

This part works well.

Then I try to use Hive to query this data, so I create a new Hive dataset as an external table by assign the --location parameter:

kite-dataset create depa_raw --location hdfs://10.0.1.63:8020/user/pnda/PNDA_datasets/datasets/kafka/depa_raw

Then I can find the table default/depa_raw and data in Hive.

But one thing wrong. With the data keep coming from Kafka to HDFS, the partition increases in HDFS by path, but in Hive table, no partition will be created automatically! Which means I can't see newly updated data in Hive.

So what can I do to solve this problem? (I just want to get newly coming data in Hive)

mkwhitacre commented 7 years ago

It is not a great solution but you can repair[1] the table with:

MSCK REPAIR TABLE

[1] - https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-RecoverPartitions(MSCKREPAIRTABLE)

kchen0x commented 7 years ago

@mkwhitacre Thank you very much, it really solved my problem.

I have one more question: I use Gobblin setting up a MapReduce job to consume data in Kafka and write it to Kite dataset. But when I try to write it directly to dataset:hive:depa_raw with

Datasets.load(datasetURI)

Map Reduce job will always fail without specific exception. Only when I set datasetURI="dataset:hdfs://<ip>:<port>/path/to/depa_raw", it can work correctly.

That is why I create a new hive dataset:

kite-dataset create depa_raw --location hdfs://10.0.1.63:8020/user/pnda/PNDA_datasets/datasets/kafka/depa_raw

So, what could be the possible reason that causes this problem?

mkwhitacre commented 7 years ago

Does it also fail when you do: "dataset:hdfs://nameservice/path/to/depa_raw"?

Without a specific exception it is harder to diagnose but I'm guessing it is a problem with your config not being populated with either the configuration for the Hive Metastore or the jars on your classpath.

kchen0x commented 7 years ago

No matter what format I use, as long as it is the dataset:hdfs, it will work. But, 'dataset:hive' will not. Imagine I have two datasets here:

All the configuration and the class I've used are here:

https://github.com/quentin-chen/gobblin-pnda