tensorflow / io

Dataset, streaming, and file system extensions maintained by TensorFlow SIG-IO
Apache License 2.0
706 stars 287 forks source link

Standardize columnized dataset? #315

Open yongtang opened 5 years ago

yongtang commented 5 years ago

With the upcoming DatasetV2 a lot of the APIs are getting simplified. That also opens up some additional possibilities than just passing the dataset to tf.keras.

One area of interest, is that we already have support for many columnized dataset, e.g, Arrow, Avro, Parquet, Json, HDF5, etc. Those dataset may potentially be standardized with the same API so that we could treat them homogeneously. For example, ArrowDataset already exposes a columns() property method. We could apply the same to Avro, Parquet, Json, HDF5 etc. Thought?

Since those columnized dataset are largely numeric values, I think one area we also could have a common base class for those dataset, and support additional operations. For example, dataset_1 + dataset_2 => dataset_3 (add) where dataset_3 could be passed to tf.keras. The implementation could start with zip + map in python (not even needed in C++). Maybe this could be one use case that will help users?

/cc @terrytangyuan @BryanCutler

BryanCutler commented 5 years ago

Sounds like a good idea to me. I think it would be pretty useful to support operations and making composite columnar datasets, like merging 2 different datasets into 1.

terrytangyuan commented 5 years ago

Yes, agreed that this will be helpful to our users. "The implementation could start with zip + map in python (not even needed in C++)" - I am worried about the performance and the efforts spent on the implementation of each additional operations though. Is there a way to reuse what's available in pandas/pyspark, etc.?

yongtang commented 5 years ago

In order to make dataset more useful we also need to support write (save) so that a dataset could be saved in certain format (e.g., Parquet, Arrow, HDF5, CSV, JSON, etc).

I have created an experimental PR #326 to add TextDataset's save support. (Will expand to other dataset format later).

yongtang commented 5 years ago

@BryanCutler @terrytangyuan With recent PRs I am writing down some notes about Data Formats:

There are many data formats available and it is hard to fit them all into one types. From data science point of view pandas could be a good reference though I noticed in some edge cases even pandas is not handling very well. Here is the list I come up with, I am also seeking some suggestions on certain cases.

1) Standalone n-dimensional array:

This is the type that is commonly seen with `numpy`'s `array`, it is also referred as a `tensor` in tensorflow, and `dataset` in HDF5.

Note in pandas, HDF5 is considered as columnar/tabular format. But that is largely an "overuse" and may caused issues as HDF5 datasets could be totally unrelated.

While this format normally fits into memory, there are cases where we really want to just take a slice of it from the imported file.
For example, `WAV` file is an audio format that could be considered as a 2-D `[n_samples, n_channels]` array. User have a need to only cut a small chunk from it.

**Operation that could be useful**:
- n-dimensional `slicing and indexing`
- `map`, `shuffle`, `filter`, `reduce`, `foldl`, `foldr` against a specific `axis`.

2) Time-series 1-dimensional array (optionally with an index array):

This is the type in pandas defined as `Series`. It normally is a tuple of `(timestamp, value)`. In pandas, `Series` could also be a tuple of `(index, value)`. Pandas handles `Series` very well, though pandas tries to fit Series as a column of `DataFrame` is indeed an overuse as there are some fundamental issues that are hard to reconcile. See discussion later about columnar/tabular format.

One good example of time-series is prometheus observation data of `(timestamp, value)`. In reality, timestamp could be dropped for machine learning as prometheus automatically extrapolate the timestamp.

**Operation that could be useful**:
- 1-dimensional `slicing and indexing`
- `sliding window`
- `tuple` into columnar/tabular
- `map`, `shuffle`, `filter`, `reduce`, `foldl`, `foldr` (`axis` always = 0)
- **range selection based on timestamp.**

3) Standalone 1-dimensional array:

I am point this format out, as 1-dimensional array is the building block of `DataFrame` in pandas and R. However, this format is not to be confused with `time-series` or `n-dimensional array` as operations are really different.

Related issues will be discussed in columnar/tabular section.

**Operation that could be useful**:
- 1-dimensional `slicing and indexing`
- `sliding window`
- `tuple` into columnar/tabular
- `map`, `shuffle`, `filter`, `reduce`, `foldl`, foldr` (`axis` always = 0)

4) Columnar/Tabular multi-column 1-dimensional array:

This is the `DataFrame` in pandas and R.

Note that columnar/tabular are really just tuples of standalone 1-dimensional arrays with each array/column different data types. The are not really n-dimensional arrays and operations that apply n-d array does not necessarily fit into columnar/tabular.

One big issue to conciliate is the "degeneration" of columnar/tabular into standalone 1-dimensional array. The APIs could be confusing if not handled well.

For example, in 1-dimensional array we could add a property of `dtype` and `shape`. Now with multi-column 1-dimensional array, we could have `dtype(column)` and `shape(column)` to take `column` as the key. But what about a tabular 1-column? Should we have `dtype(None)`, or `dtype`?

I overall think we could just take a tuple approach. Similar to tuple of `(1,)` is different from `(1)` in python (quite subtle). Haven't had the implementation completely conform to python' tuple yet.

Also, in pandas, accessing each column through `[column]` is quite confusing to many user. as the order of axis quite non-conventional.

Note in PR #437 I use `(column)` to access column, and use `[]` to access indexing and slicing within column. I think this approach might be better.

**Operation that could be useful**:
- all 1-dimensional operations for 1-dimensional
- merge to multiple-dimensional array (if data type matches).

5) Key-Value pairs of two 1-dimensional array:

This is similar to time-series in appearance. However, they are fundamentally different. A Key-Value pair of two 1-dimensional array has to be indexable through indexing of key. A time-series does not necessary needs to be indexable through the timestamp, it only needs to be selectable through a range of timestamp.

One good example is LMDB data. At the time LMDB was added into tf.contrib, I tried to fit LMDB into a iterable of tf.data. This was used in tensorflow-datasets in SUN dataset in a similar way.

However, LMDB is really a key-value store and treat key-value store as an iterable of key/value pair is not the right usage, because 1) user either just want one value associate with a key, or just multiple values associated with selected keys. 2) value could be huge. So the correct usage should be:

**Operation that could be useful**:
- **iteration of keys** (possibly with some filtering mechanism)
- **selection of value based on key** (indexing through key)

Note with the rise of No-SQL a lot of key-value data stores should be fit into this scenario. Even S3, GCS themselves are key-value store in nature. The content of the value could be a file so we have S3/GCS file system but this is a separate discussion.

5) Pure Streaming/terable:

Kafka/PubSub/etc are the standard examples, large PCAP file could also be a good example. Basically the only viable operation is to iterate through. There might be cases where people want to convert them into a 1-dimensional array if the total number of elements could be fit into memory.

**Operation that could be useful**:
- iteration of elements
- conversion of 1-dimensional array (if fit into memory)

5) Images:

Even though image format is the format almost everyone work with, the format is far more complicated:
a) **JPEG/BMP/PNG/WEBP** are pure 2-dimensional array with single **1-frame**
b) **GIF** could be a **n-frame** 2-dimensional array with equal shape.
c) **TIFF** is really a collection of sub images with **different shape**, so it is a **n-image** of **1-frame**, not a **n-frame**.
d) **DICOM** comes with deep meta data that you really could not throw away.
e) **OpenEXR** consists of multi-parts with deep meta data

Haven't touched images for quite some time because of the real complexity (than naive formats such as JPEG/BMP). Plan to revisit soon.

6) Audio:

Haven't touches lots of audio format, though it looks like audio formats are mostly `[n-sample, n-channel]` array with metadata/annotation. Need more time to understand.

7) Video:

This is another area of complicity. We only have a basic FFmpeg support and it will be a long way.

Some things to consider:
a) Some video files are indexable while others are streaming/iterable
b) Video files could comes with different shapes for each frames (so it could be `n element of n-frame`, or `1 element of n-frame`.
c) Video files could have rich annotation/metadata.
d) Video files may comes with audio (and with different language support).
e) Video files could also comes with sub (text).

On top of the above notes, one items I am actually looking into, is to have IOTensor combine with tf.function so that operations could be "attached" to IOTensor, and lazily run only on selected slicing/indexing. I think this approach will show good value to have a IO backed Tensor (than reading everything in file to a Tensor first).