tensorflow / io

Dataset, streaming, and file system extensions maintained by TensorFlow SIG-IO
Apache License 2.0
698 stars 281 forks source link

Apache ORC Support in TensorFlow IO #1372

Open oliverhu opened 3 years ago

oliverhu commented 3 years ago

(Creating this issue for visibility so people interested can join the discussion... )

Overview

Load Apache ORC formatted data natively into TensorFlow from file system supported by TensorFlow, e.g. HDFS, local disk, etc.

Motivation

We traditionally use Avro to store our dataset but it is becoming inefficient to use row based format for big data analytics processing. Historically we selected ORC as our columnar storage format. (not planning to argue Parquet vs ORC here ;))

Design Discussions

Milestones

kvignesh1420 commented 3 years ago

@oliverhu any update on this?

oliverhu commented 3 years ago

no update recently @kvignesh1420

kvignesh1420 commented 3 years ago

@oliverhu can we document the current feature in the form of a tutorial?

oliverhu commented 3 years ago

sure, will add that !

kvignesh1420 commented 3 years ago

Reference FYKI: https://github.com/tensorflow/io/tree/master/docs/tutorials

372046933 commented 2 years ago

Is HDFS supported now? Loading from HDFS path results in coredump

dataset = tfio.IODataset.from_orc("hdfs://xxx/yy/iris.orc", capacity=15).batch(1)
372046933 commented 2 years ago

Is HDFS supported now? Loading from HDFS path results in coredump

dataset = tfio.IODataset.from_orc("hdfs://xxx/yy/iris.orc", capacity=15).batch(1)

HDFS supported (with kerberos) by https://github.com/tensorflow/io/pull/1674