huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.03k stars 2.63k forks source link

[to_tf_dataset] Use Feather for better compatibility with TensorFlow ? #4542

Open lhoestq opened 2 years ago

lhoestq commented 2 years ago

To have better performance in TensorFlow, it is important to provide lists of data files in supported formats. For example sharded TFRecords datasets are extremely performant. This is because tf.data can better leverage parallelism in this case, and load one file at a time in memory.

It seems that using tensorflow_io we could have something similar for to_tf_dataset if we provide sharded Feather files: https://www.tensorflow.org/io/api_docs/python/tfio/arrow/ArrowFeatherDataset

Feather is a format almost equivalent to the Arrow IPC Stream format we're using in datasets: Feather V2 is equivalent to Arrow IPC File format, which is an extension of the stream format (it has an extra footer). Therefore we could store datasets as Feather instead of Arrow IPC Stream format without breaking the whole library.

Here are a few points to explore

We would also need to implement sharding when loading a dataset (this will be done anyway for #546)

cc @Rocketknight1 @gante feel free to comment in case I missed anything !

I'll share some files and scripts, so that we can benchmark performance of Feather files with tf.data

Rocketknight1 commented 2 years ago

This has so much potential to be great! Also I think you tagged some poor random dude on the internet whose name is also Joao, lol, edited that for you!

Rocketknight1 commented 2 years ago

cc @sayakpaul here too, since he was interested in our new approaches to converting datasets!

sayakpaul commented 2 years ago

Noted and I will look into the thread in detail tomorrow once I log back in.

sayakpaul commented 2 years ago

@lhoestq I have used TFRecords with tf.data for both vision and text and I can say that they are quite performant. I haven't worked with Feather yet as similarly as I have with TFRecords. If you haven't started the benchmarking script yet, I can prepare a Colab notebook that loads Feather files, converts them into a tf.data pipeline, and does some basic preprocessing.

But in my limited understanding, Feather might be better suited for CSV files. Not yet sure if it's good for modalities like images.

lhoestq commented 2 years ago

Not yet sure if it's good for modalities like images.

We store images pretty much the same way as tensorflow_datasets (i.e. storing the encoded image bytes, or a path to the local image, so that the image can be decoded on-the-fly), so as long as we use something similar as TFDS for image decoding it should be ok

sayakpaul commented 2 years ago

So for image datasets, we could potentially store the paths in the feather format and decode and read them on the fly? But it introduces an I/O redundancy of having to read the images every time.

With caching it could be somewhat mitigated but it's not a good solution for bigger image datasets.

lhoestq commented 2 years ago

So for image datasets, we could potentially store the paths in the feather format and decode and read them on the fly?

hopefully yes :)

I double-checked the TFDS source code and they always save the bytes actually, not the path. Anyway we'll see if we run into issues or not (as a first step we can require the bytes to be in the feather file)

sayakpaul commented 2 years ago

Yes. For images, TFDS actually prepares TFRecords first for encoding and then reuses them for every subsequent call.

sayakpaul commented 2 years ago

@lhoestq @Rocketknight1 I worked on this PoC that

I haven't benchmarked those different options yet. There's also a gotcha that I have noted in the PoC. I hope it gets us started but I'm sorry if this is redundant.

lhoestq commented 2 years ago

Cool thanks ! If I understand correctly in your PoC you store the flattened array of pixels in the feather file. This will take a lot of disk space.

Maybe we could just save the encoded bytes and let users apply a map to decode/transform them into the format they need for training ? Users can use tf.image to do so for example

sayakpaul commented 2 years ago

@lhoestq this is what I tried:

def read_image(path):
    with open(path, "rb") as f:
        return f.read()

total_images_written = 0

for step in tqdm.tnrange(int(math.ceil(len(image_paths) / batch_size))):
    batch_image_paths = image_paths[step * batch_size : (step + 1) * batch_size]
    batch_image_labels = all_integer_labels[step * batch_size : (step + 1) * batch_size]

    data = [read_image(path) for path in batch_image_paths]
    table = pa.Table.from_arrays([data, batch_image_labels], ["data", "labels"])
    write_feather(table, f"/tmp/flowers_feather_{step}.feather", chunksize=chunk_size)
    total_images_written += len(batch_image_paths)
    print(f"Total images written: {total_images_written}.")

    del data

I got the feather files done (no resizing required as you can see):

ls -lh /tmp/*.feather

-rw-r--r--  1 sayakpaul  wheel    64M Jun 24 09:28 /tmp/flowers_feather_0.feather
-rw-r--r--  1 sayakpaul  wheel    59M Jun 24 09:28 /tmp/flowers_feather_1.feather
-rw-r--r--  1 sayakpaul  wheel    51M Jun 24 09:28 /tmp/flowers_feather_2.feather
-rw-r--r--  1 sayakpaul  wheel    45M Jun 24 09:28 /tmp/flowers_feather_3.feather

Now there seems to be a problem with tfio.arrow:

import tensorflow_io.arrow as arrow_io

dataset = arrow_io.ArrowFeatherDataset(
    ["/tmp/flowers_feather_0.feather"],
    columns=(0, 1),
    output_types=(tf.string, tf.int64),
    output_shapes=([], []),
    batch_mode="auto",
)

print(dataset.element_spec) 

Prints:

(TensorSpec(shape=(None,), dtype=tf.string, name=None),
 TensorSpec(shape=(None,), dtype=tf.int64, name=None))

But when I do sample = next(iter(dataset)) it goes into:

InternalError                             Traceback (most recent call last)
Input In [30], in <cell line: 1>()
----> 1 sample = next(iter(dataset))

File ~/.local/bin/.virtualenvs/jax/lib/python3.8/site-packages/tensorflow/python/data/ops/iterator_ops.py:766, in OwnedIterator.__next__(self)
    764 def __next__(self):
    765   try:
--> 766     return self._next_internal()
    767   except errors.OutOfRangeError:
    768     raise StopIteration

File ~/.local/bin/.virtualenvs/jax/lib/python3.8/site-packages/tensorflow/python/data/ops/iterator_ops.py:749, in OwnedIterator._next_internal(self)
    746 # TODO(b/77291417): This runs in sync mode as iterators use an error status
    747 # to communicate that there is no more data to iterate over.
    748 with context.execution_mode(context.SYNC):
--> 749   ret = gen_dataset_ops.iterator_get_next(
    750       self._iterator_resource,
    751       output_types=self._flat_output_types,
    752       output_shapes=self._flat_output_shapes)
    754   try:
    755     # Fast path for the case `self._structure` is not a nested structure.
    756     return self._element_spec._from_compatible_tensor_list(ret)  # pylint: disable=protected-access

File ~/.local/bin/.virtualenvs/jax/lib/python3.8/site-packages/tensorflow/python/ops/gen_dataset_ops.py:3017, in iterator_get_next(iterator, output_types, output_shapes, name)
   3015   return _result
   3016 except _core._NotOkStatusException as e:
-> 3017   _ops.raise_from_not_ok_status(e, name)
   3018 except _core._FallbackException:
   3019   pass

File ~/.local/bin/.virtualenvs/jax/lib/python3.8/site-packages/tensorflow/python/framework/ops.py:7164, in raise_from_not_ok_status(e, name)
   7162 def raise_from_not_ok_status(e, name):
   7163   e.message += (" name: " + name if name is not None else "")
-> 7164   raise core._status_to_exception(e) from None

InternalError: Invalid: INVALID_ARGUMENT: arrow data type 0x7ff9899d8038 is not supported: Type error: Arrow data type is not supported [Op:IteratorGetNext]

Some additional notes:

sayakpaul commented 2 years ago

@lhoestq I think I was able to make it work in the way you were envisioning. Here's the PoC: https://gist.github.com/sayakpaul/f7d5cc312cd01cb31098fad3fd9c6b59#file-feather-tf-poc-bytes-ipynb

Some details:

Cc: @Rocketknight1 @gante

lhoestq commented 2 years ago

Cool thanks ! Too bad the Arrow binary type doesn't seem to be supported in arrow_io.ArrowFeatherDataset :/ We would also need it to support Arrow struct type. Indeed images in datasets are represented using an Arrow type

pa.struct({"path": pa.string(), "bytes": pa.binary()})

not sure yet how hard it is to support this though.

Changing the typing on our side would create concerning breaking changes, that's why it would be awesome if it could work using these types

sayakpaul commented 2 years ago

If the ArrowFeatherDataset doesn't yet support it, I guess our hands are a bit tied at the moment.

IIUC, in my latest PoC notebook, you wanted to see each entry in the feather file to be represented like so?

pa.struct({"path": pa.string(), "bytes": pa.binary()})

In that case, pa.binary() isn't yet supported.

lhoestq commented 2 years ago

IIUC, in my latest PoC notebook, you wanted to see each entry in the feather file to be represented like so?

pa.struct({"path": pa.string(), "bytes": pa.binary()})

Yea because that's the data format we're using. If we were to use base64, then we would have to process the full dataset to convert it, which can take some time. Converting to TFRecords would be simpler than converting to base64 in Feather files.

Maybe it would take too much time to be worth exploring, but according to https://github.com/tensorflow/io/issues/1361#issuecomment-819029002 it's possible to add support for binary type in ArrowFeatherDataset. What do you think ? Any other alternative in mind ?

sayakpaul commented 2 years ago

Maybe it would take too much time to be worth exploring, but according to https://github.com/tensorflow/io/issues/1361#issuecomment-819029002 it's possible to add support for binary type in ArrowFeatherDataset.

Should be possible as per the comment but there hasn't been any progress and it's been more than a year.

If we were to use base64, then we would have to process the full dataset to convert it, which can take some time.

I don't understand this. I would think TFRecords would also need something similar but I need the context you're coming from.

What do you think ? Any other alternative in mind ?

TFRecords since the TensorFlow ecosystem has developed good support for it over the years.

lhoestq commented 2 years ago

I don't understand this. I would think TFRecords would also need something similar but I need the context you're coming from.

Users already have a copy of the dataset in Arrow format (we can change this to Feather). So to load the Arrow/feather files to a TF dataset we need TF IO or something like that. Otherwise the user has to convert all the files from Arrow to TFRecords to use TF data efficiently. But the conversion needs resources: CPU, disk, time. Converting the images to base64 require the same sort of resources.

So the issue we're trying to tackle is how to load the Arrow data in TF without having to convert anything ^^

Rocketknight1 commented 2 years ago

Yeah, it looks like in its current state the tfio support for Feather is incomplete, so we'd end up having to write a lot of it, or do a conversion that defeats the whole point (because if we're going to convert the whole dataset we might as well convert to TFRecord).

sayakpaul commented 2 years ago

Understood @lhoestq. Thanks for explaining!

Agreed with @Rocketknight1.

Rocketknight1 commented 2 years ago

@lhoestq Although I think this is a dead-end for now unfortunately, because of the limitations at TF's end, we could still explore automatic conversion to TFRecord, or I could dive into refining to_tf_dataset() to yield unbatched samples and/or load samples with multiprocessing to improve throughput. Do you have any preferences there?

sayakpaul commented 2 years ago

@lhoestq Although I think this is a dead-end for now unfortunately, because of the limitations at TF's end, we could still explore automatic conversion to TFRecord, or I could dive into refining to_tf_dataset() to yield unbatched samples and/or load samples with multiprocessing to improve throughput. Do you have any preferences there?

Happy to take part there @Rocketknight1.

lhoestq commented 2 years ago

If to_tf_dataset can be unbatched, then it should be fairly easy for users to convert the TF dataset to TFRecords right ?

sayakpaul commented 2 years ago

@lhoestq why one would convert to TFRecords after unbatching?

Rocketknight1 commented 2 years ago

If to_tf_dataset can be unbatched, then it should be fairly easy for users to convert the TF dataset to TFRecords right ?

Sort of! A tf.data.Dataset is more like an iterator, and does not support sample indexing. to_tf_dataset() creates an iterator, but to convert that to TFRecord, the user would have to iterate over the whole thing and manually save the stream of samples to files.

lhoestq commented 2 years ago

Someone would like to try to dive into tfio to fix this ? Sounds like a good opportunity to learn what are the best ways to load a dataset for TF, and also the connections between Arrow and TF.

If we can at least have the Arrow binary type working for TF that would be awesome already (issue https://github.com/tensorflow/io/issues/1361)

also cc @nateraw in case you'd be interested ;)

sayakpaul commented 2 years ago

Sounds like a good opportunity to learn what are the best ways to load a dataset for TF

The recommended way would likely be a combination of TFRecords and tf.data.

Exploring the connection between Arrow and TensorFlow is definitely worth pursuing though. But I am not sure about the implications of storing images in a format supported by Arrow. I guess we'll know more once we have at least figured out the support for binary type for TFIO. I will spend some time on it and keep this thread updated.

sayakpaul commented 2 years ago

I am currently working on a fine-tuning notebook for the TFSegFormer model (Semantic Segmentation). The resolution is high for both the input images and the labels - (512, 512, 3). Here's the Colab Notebook (it's a WIP so please bear that in mind).

I think the current implementation of to_tf_dataset() does create a bottleneck here since the GPU utilization is quite low.

sayakpaul commented 2 years ago

Here's a notebook showing the performance difference: https://colab.research.google.com/gist/sayakpaul/d7ca67c90beb47e354942c9d8c0bd8ef/scratchpad.ipynb.

Note that I acknowledge that it's not an apples-to-apples comparison in many aspects (the dataset isn't the same, data serialization format isn't the same, etc.) but this is the best I could do.

lhoestq commented 2 years ago

Thanks ! I think the speed difference can be partly explained: you use ds.shuffle in your dataset, which is an exact shuffling (compared to TFDS which does buffer shuffling): it slows down query time by 2x to 10x since it has to play with data that are not contiguous.

The rest of the speed difference seems to be caused by image decoding (from 330µs/image to 30ms/image)

sayakpaul commented 2 years ago

Fair enough. Can do one without shuffling too. But it's an important one to consider I guess.

sayakpaul commented 2 years ago

@lhoestq the latest tfio-nightly (https://pypi.org/project/tensorflow-io-nightly/) supports direct bytes type. Here's a notebook demonstrating that: https://gist.github.com/sayakpaul/f7d5cc312cd01cb31098fad3fd9c6b59#file-feather-v2-tfio-ipynb.

The notebook does the following:

Cc: @Rocketknight1 @gante

sayakpaul commented 2 years ago

@lhoestq I also incorporated the ArrowType you suggested in https://github.com/huggingface/datasets/issues/4542/#issuecomment-1166588367.

This is the updated notebook: https://gist.github.com/sayakpaul/f7d5cc312cd01cb31098fad3fd9c6b59#file-feather-v2-struct-tfio-ipynb. It doesn't store the data in dictionaries, so I need some assistance here to make it happen.

lhoestq commented 2 years ago

Awesome !

Regarding dictionaries: is it because the ArrowFeatherDataset can't read a struct column ? It would be nice if struct was supported, or if we could at least specify a nested column in ArrowFeatherDataset(..., columns=...) (we could reconstruct the dictionary afterward using map anyway).

sayakpaul commented 2 years ago

columns supports specifying column indices, so not too sure about the column name interpretation there.

I could not figure out a way to serialize the data in form of dictionaries created using the ArrowType. If you have any suggestions, please let me know.

lhoestq commented 2 years ago

Ok I see ! You can create a Feather file with struct data this way:

batch_image_paths = image_paths[step * batch_size : (step + 1) * batch_size]
batch_image_labels = all_integer_labels[step * batch_size : (step + 1) * batch_size]
table = pa.table({
    "label": batch_image_labels,
    "audio": [{"path": path, "bytes": read_image(path)} for path in batch_image_paths]
})
write_feather(table, f"/tmp/flowers_feather_{step}.feather", chunksize=chunk_size)
sayakpaul commented 2 years ago

Okay. Let me look into it and get back to you. Let's discuss the next steps then?

lhoestq commented 2 years ago

I tried creating an ArrowFeatherDataset from a dummy Feather file with one column with struct data and it seems to fail:

>>> ds = ArrowFeatherDataset("tmp.feather", columns=(0,), output_types=({"audio": {"bytes": tf.string, "path": tf.string}},))
>>> list(ds)
InternalError: Invalid: INVALID_ARGUMENT: arrow data type 0x7f8c2cce83b8 is not supported: Type error: Arrow data type is not supported [Op:IteratorGetNext]

(I hope I got the output_types right though)

Maybe let's ask the TFIO team ?

sayakpaul commented 2 years ago

Could you provide a link to your notebook? I would like to give it a try.

lhoestq commented 2 years ago

I simply ran this

import pyarrow as pa
import tensorflow as tf
import tensorflow_io.arrow as arrow_io
from pyarrow.feather import write_feather

table = pa.table({
    "audio": [{"path": "foo", "bytes": b"bar"}]
})
write_feather(table, "tmp.feather")
ds = arrow_io.ArrowFeatherDataset(
    "tmp.feather", columns=(0,), output_types=({"audio": {"bytes": tf.string, "path": tf.string}},)
)
sayakpaul commented 2 years ago

@lhoestq refer to this notebook once: https://gist.github.com/sayakpaul/f7d5cc312cd01cb31098fad3fd9c6b59#file-feather-v2-struct-tfio-ipynb.

I serialized the tables like so:

    table = pa.table(
        {"path": batch_image_paths, "image_bytes": data, "label": batch_image_labels}
    )

From the documentation of ArrowFeatherDataset:

output_types => Tensor dtypes of the output tensors

I am not sure if it's possible to pass raw dictionaries to it output_types like you did here.

sayakpaul commented 2 years ago

As per https://github.com/tensorflow/io/issues/1705, having the struct in the output_type is a requirement. I am not sure why this needs to be explicitly enforced.

lhoestq commented 2 years ago

I'm just not sure how to properly specify the output_types in this case tbh ^^'

sayakpaul commented 2 years ago

If we follow:

    table = pa.table(
        {"path": batch_image_paths, "image_bytes": data, "label": batch_image_labels}
    )

then (full code here):

output_types=(tf.string, tf.string, tf.int64)

The user just needs to know the schema (in particular the dtypes) of the struct.

lhoestq commented 2 years ago

I see ! The thing what output_types do we set when the data is nested in a struct like {"audio": [{"path": "foo", "bytes": b"bar"}]} ? I couldn't find a way to do make it work

sayakpaul commented 2 years ago

Okay. I am unsure why having such a nested structure is required. The structure shown in https://github.com/huggingface/datasets/issues/4542#issuecomment-1226996068 doesn't suffice?

lhoestq commented 2 years ago

That's because the Image and Audio type in datasets are actually backed by an Arrow struct type containing one field for the bytes and one field for the file name. So currently you would have to do an extra conversion step to have a flat structure, and this requires to copy the dataset which can take lots of time and disk space.

sayakpaul commented 2 years ago

Hmm got it. Then I guess we're back to the TFIO team.

sayakpaul commented 1 year ago

Which encoding scheme you'd suggest? TFRecord (tf.train.Example) solves this problem quite nice.

On Thu, Jun 23, 2022, 18:42 Quentin Lhoest @.***> wrote:

If I understand correctly in your PoC you store the flattened array of pixels in the feather file. This will take a lot of disk space.

Maybe we could just save the encoded bytes and let users apply a map to decode/transform them into the format they need for training ? Users can use tf.image to do so for example

— Reply to this email directly, view it on GitHub https://github.com/huggingface/datasets/issues/4542#issuecomment-1164392223, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFPE2TFGKRMQLINHC4MWZXLVQRPEPANCNFSM5ZQPZVKA . You are receiving this because you were mentioned.Message ID: @.***>