Open lhoestq opened 2 years ago
This has so much potential to be great! Also I think you tagged some poor random dude on the internet whose name is also Joao, lol, edited that for you!
cc @sayakpaul here too, since he was interested in our new approaches to converting datasets!
Noted and I will look into the thread in detail tomorrow once I log back in.
@lhoestq I have used TFRecords with tf.data
for both vision and text and I can say that they are quite performant. I haven't worked with Feather yet as similarly as I have with TFRecords. If you haven't started the benchmarking script yet, I can prepare a Colab notebook that loads Feather files, converts them into a tf.data
pipeline, and does some basic preprocessing.
But in my limited understanding, Feather might be better suited for CSV files. Not yet sure if it's good for modalities like images.
Not yet sure if it's good for modalities like images.
We store images pretty much the same way as tensorflow_datasets (i.e. storing the encoded image bytes, or a path to the local image, so that the image can be decoded on-the-fly), so as long as we use something similar as TFDS for image decoding it should be ok
So for image datasets, we could potentially store the paths in the feather format and decode and read them on the fly? But it introduces an I/O redundancy of having to read the images every time.
With caching it could be somewhat mitigated but it's not a good solution for bigger image datasets.
So for image datasets, we could potentially store the paths in the feather format and decode and read them on the fly?
hopefully yes :)
I double-checked the TFDS source code and they always save the bytes actually, not the path. Anyway we'll see if we run into issues or not (as a first step we can require the bytes to be in the feather file)
Yes. For images, TFDS actually prepares TFRecords first for encoding and then reuses them for every subsequent call.
@lhoestq @Rocketknight1 I worked on this PoC that
tf_flowers
).I haven't benchmarked those different options yet. There's also a gotcha that I have noted in the PoC. I hope it gets us started but I'm sorry if this is redundant.
Cool thanks ! If I understand correctly in your PoC you store the flattened array of pixels in the feather file. This will take a lot of disk space.
Maybe we could just save the encoded bytes and let users apply a map
to decode/transform them into the format they need for training ? Users can use tf.image to do so for example
@lhoestq this is what I tried:
def read_image(path):
with open(path, "rb") as f:
return f.read()
total_images_written = 0
for step in tqdm.tnrange(int(math.ceil(len(image_paths) / batch_size))):
batch_image_paths = image_paths[step * batch_size : (step + 1) * batch_size]
batch_image_labels = all_integer_labels[step * batch_size : (step + 1) * batch_size]
data = [read_image(path) for path in batch_image_paths]
table = pa.Table.from_arrays([data, batch_image_labels], ["data", "labels"])
write_feather(table, f"/tmp/flowers_feather_{step}.feather", chunksize=chunk_size)
total_images_written += len(batch_image_paths)
print(f"Total images written: {total_images_written}.")
del data
I got the feather files done (no resizing required as you can see):
ls -lh /tmp/*.feather
-rw-r--r-- 1 sayakpaul wheel 64M Jun 24 09:28 /tmp/flowers_feather_0.feather
-rw-r--r-- 1 sayakpaul wheel 59M Jun 24 09:28 /tmp/flowers_feather_1.feather
-rw-r--r-- 1 sayakpaul wheel 51M Jun 24 09:28 /tmp/flowers_feather_2.feather
-rw-r--r-- 1 sayakpaul wheel 45M Jun 24 09:28 /tmp/flowers_feather_3.feather
Now there seems to be a problem with tfio.arrow
:
import tensorflow_io.arrow as arrow_io
dataset = arrow_io.ArrowFeatherDataset(
["/tmp/flowers_feather_0.feather"],
columns=(0, 1),
output_types=(tf.string, tf.int64),
output_shapes=([], []),
batch_mode="auto",
)
print(dataset.element_spec)
Prints:
(TensorSpec(shape=(None,), dtype=tf.string, name=None),
TensorSpec(shape=(None,), dtype=tf.int64, name=None))
But when I do sample = next(iter(dataset))
it goes into:
InternalError Traceback (most recent call last)
Input In [30], in <cell line: 1>()
----> 1 sample = next(iter(dataset))
File ~/.local/bin/.virtualenvs/jax/lib/python3.8/site-packages/tensorflow/python/data/ops/iterator_ops.py:766, in OwnedIterator.__next__(self)
764 def __next__(self):
765 try:
--> 766 return self._next_internal()
767 except errors.OutOfRangeError:
768 raise StopIteration
File ~/.local/bin/.virtualenvs/jax/lib/python3.8/site-packages/tensorflow/python/data/ops/iterator_ops.py:749, in OwnedIterator._next_internal(self)
746 # TODO(b/77291417): This runs in sync mode as iterators use an error status
747 # to communicate that there is no more data to iterate over.
748 with context.execution_mode(context.SYNC):
--> 749 ret = gen_dataset_ops.iterator_get_next(
750 self._iterator_resource,
751 output_types=self._flat_output_types,
752 output_shapes=self._flat_output_shapes)
754 try:
755 # Fast path for the case `self._structure` is not a nested structure.
756 return self._element_spec._from_compatible_tensor_list(ret) # pylint: disable=protected-access
File ~/.local/bin/.virtualenvs/jax/lib/python3.8/site-packages/tensorflow/python/ops/gen_dataset_ops.py:3017, in iterator_get_next(iterator, output_types, output_shapes, name)
3015 return _result
3016 except _core._NotOkStatusException as e:
-> 3017 _ops.raise_from_not_ok_status(e, name)
3018 except _core._FallbackException:
3019 pass
File ~/.local/bin/.virtualenvs/jax/lib/python3.8/site-packages/tensorflow/python/framework/ops.py:7164, in raise_from_not_ok_status(e, name)
7162 def raise_from_not_ok_status(e, name):
7163 e.message += (" name: " + name if name is not None else "")
-> 7164 raise core._status_to_exception(e) from None
InternalError: Invalid: INVALID_ARGUMENT: arrow data type 0x7ff9899d8038 is not supported: Type error: Arrow data type is not supported [Op:IteratorGetNext]
Some additional notes:
I can actually decode an image encoded with read_image()
(shown earlier):
sample_image_path = image_paths[0]
encoded_image = read_image(sample_image_path)
image = tf.image.decode_png(encoded_image, 3)
print(image.shape)
If the above tf.data.Dataset
object would have succeeded my plan was to just map the decoder like so:
autotune = tf.data.AUTOTUNE
dataset = dataset.map(lambda x, y: (tf.image.decode_png(x, 3), y), num_parallel_calls=autotune)
@lhoestq I think I was able to make it work in the way you were envisioning. Here's the PoC: https://gist.github.com/sayakpaul/f7d5cc312cd01cb31098fad3fd9c6b59#file-feather-tf-poc-bytes-ipynb
Some details:
base64
). In comparison to the flattened arrays as before, the size of the individual feather files has reduced (144 MB -> 85 MB, largest).tf.io.decode_base64
) as an image with tf.image.decode_png()
. tf.data
pipeline) involves the following libraries: pyarraow
, tensorflow-io
, and tensorflow
. Cc: @Rocketknight1 @gante
Cool thanks ! Too bad the Arrow binary type doesn't seem to be supported in arrow_io.ArrowFeatherDataset
:/ We would also need it to support Arrow struct type. Indeed images in datasets
are represented using an Arrow type
pa.struct({"path": pa.string(), "bytes": pa.binary()})
not sure yet how hard it is to support this though.
Changing the typing on our side would create concerning breaking changes, that's why it would be awesome if it could work using these types
If the ArrowFeatherDataset doesn't yet support it, I guess our hands are a bit tied at the moment.
IIUC, in my latest PoC notebook, you wanted to see each entry in the feather file to be represented like so?
pa.struct({"path": pa.string(), "bytes": pa.binary()})
In that case, pa.binary()
isn't yet supported.
IIUC, in my latest PoC notebook, you wanted to see each entry in the feather file to be represented like so?
pa.struct({"path": pa.string(), "bytes": pa.binary()})
Yea because that's the data format we're using. If we were to use base64, then we would have to process the full dataset to convert it, which can take some time. Converting to TFRecords would be simpler than converting to base64 in Feather files.
Maybe it would take too much time to be worth exploring, but according to https://github.com/tensorflow/io/issues/1361#issuecomment-819029002 it's possible to add support for binary type in ArrowFeatherDataset. What do you think ? Any other alternative in mind ?
Maybe it would take too much time to be worth exploring, but according to https://github.com/tensorflow/io/issues/1361#issuecomment-819029002 it's possible to add support for binary type in ArrowFeatherDataset.
Should be possible as per the comment but there hasn't been any progress and it's been more than a year.
If we were to use base64, then we would have to process the full dataset to convert it, which can take some time.
I don't understand this. I would think TFRecords would also need something similar but I need the context you're coming from.
What do you think ? Any other alternative in mind ?
TFRecords since the TensorFlow ecosystem has developed good support for it over the years.
I don't understand this. I would think TFRecords would also need something similar but I need the context you're coming from.
Users already have a copy of the dataset in Arrow format (we can change this to Feather). So to load the Arrow/feather files to a TF dataset we need TF IO or something like that. Otherwise the user has to convert all the files from Arrow to TFRecords to use TF data efficiently. But the conversion needs resources: CPU, disk, time. Converting the images to base64 require the same sort of resources.
So the issue we're trying to tackle is how to load the Arrow data in TF without having to convert anything ^^
Yeah, it looks like in its current state the tfio support for Feather
is incomplete, so we'd end up having to write a lot of it, or do a conversion that defeats the whole point (because if we're going to convert the whole dataset we might as well convert to TFRecord
).
Understood @lhoestq. Thanks for explaining!
Agreed with @Rocketknight1.
@lhoestq Although I think this is a dead-end for now unfortunately, because of the limitations at TF's end, we could still explore automatic conversion to TFRecord, or I could dive into refining to_tf_dataset()
to yield unbatched samples and/or load samples with multiprocessing to improve throughput. Do you have any preferences there?
@lhoestq Although I think this is a dead-end for now unfortunately, because of the limitations at TF's end, we could still explore automatic conversion to TFRecord, or I could dive into refining
to_tf_dataset()
to yield unbatched samples and/or load samples with multiprocessing to improve throughput. Do you have any preferences there?
Happy to take part there @Rocketknight1.
If to_tf_dataset
can be unbatched, then it should be fairly easy for users to convert the TF dataset to TFRecords right ?
@lhoestq why one would convert to TFRecords after unbatching?
If to_tf_dataset can be unbatched, then it should be fairly easy for users to convert the TF dataset to TFRecords right ?
Sort of! A tf.data.Dataset
is more like an iterator, and does not support sample indexing. to_tf_dataset()
creates an iterator, but to convert that to TFRecord
, the user would have to iterate over the whole thing and manually save the stream of samples to files.
Someone would like to try to dive into tfio to fix this ? Sounds like a good opportunity to learn what are the best ways to load a dataset for TF, and also the connections between Arrow and TF.
If we can at least have the Arrow binary
type working for TF that would be awesome already (issue https://github.com/tensorflow/io/issues/1361)
also cc @nateraw in case you'd be interested ;)
Sounds like a good opportunity to learn what are the best ways to load a dataset for TF
The recommended way would likely be a combination of TFRecords and tf.data
.
Exploring the connection between Arrow and TensorFlow is definitely worth pursuing though. But I am not sure about the implications of storing images in a format supported by Arrow. I guess we'll know more once we have at least figured out the support for binary
type for TFIO. I will spend some time on it and keep this thread updated.
I am currently working on a fine-tuning notebook for the TFSegFormer model (Semantic Segmentation). The resolution is high for both the input images and the labels - (512, 512, 3). Here's the Colab Notebook (it's a WIP so please bear that in mind).
I think the current implementation of to_tf_dataset()
does create a bottleneck here since the GPU utilization is quite low.
Here's a notebook showing the performance difference: https://colab.research.google.com/gist/sayakpaul/d7ca67c90beb47e354942c9d8c0bd8ef/scratchpad.ipynb.
Note that I acknowledge that it's not an apples-to-apples comparison in many aspects (the dataset isn't the same, data serialization format isn't the same, etc.) but this is the best I could do.
Thanks ! I think the speed difference can be partly explained: you use ds.shuffle in your dataset, which is an exact shuffling (compared to TFDS which does buffer shuffling): it slows down query time by 2x to 10x since it has to play with data that are not contiguous.
The rest of the speed difference seems to be caused by image decoding (from 330µs/image to 30ms/image)
Fair enough. Can do one without shuffling too. But it's an important one to consider I guess.
@lhoestq the latest tfio-nightly
(https://pypi.org/project/tensorflow-io-nightly/) supports direct bytes type. Here's a notebook demonstrating that: https://gist.github.com/sayakpaul/f7d5cc312cd01cb31098fad3fd9c6b59#file-feather-v2-tfio-ipynb.
The notebook does the following:
tf.data.Dataset
object out of those feather files.Cc: @Rocketknight1 @gante
@lhoestq I also incorporated the ArrowType you suggested in https://github.com/huggingface/datasets/issues/4542/#issuecomment-1166588367.
This is the updated notebook: https://gist.github.com/sayakpaul/f7d5cc312cd01cb31098fad3fd9c6b59#file-feather-v2-struct-tfio-ipynb. It doesn't store the data in dictionaries, so I need some assistance here to make it happen.
Awesome !
Regarding dictionaries: is it because the ArrowFeatherDataset can't read a struct column ? It would be nice if struct was supported, or if we could at least specify a nested column in ArrowFeatherDataset(..., columns=...)
(we could reconstruct the dictionary afterward using map
anyway).
columns
supports specifying column indices, so not too sure about the column name interpretation there.
I could not figure out a way to serialize the data in form of dictionaries created using the ArrowType. If you have any suggestions, please let me know.
Ok I see ! You can create a Feather file with struct data this way:
batch_image_paths = image_paths[step * batch_size : (step + 1) * batch_size]
batch_image_labels = all_integer_labels[step * batch_size : (step + 1) * batch_size]
table = pa.table({
"label": batch_image_labels,
"audio": [{"path": path, "bytes": read_image(path)} for path in batch_image_paths]
})
write_feather(table, f"/tmp/flowers_feather_{step}.feather", chunksize=chunk_size)
Okay. Let me look into it and get back to you. Let's discuss the next steps then?
I tried creating an ArrowFeatherDataset
from a dummy Feather file with one column with struct data and it seems to fail:
>>> ds = ArrowFeatherDataset("tmp.feather", columns=(0,), output_types=({"audio": {"bytes": tf.string, "path": tf.string}},))
>>> list(ds)
InternalError: Invalid: INVALID_ARGUMENT: arrow data type 0x7f8c2cce83b8 is not supported: Type error: Arrow data type is not supported [Op:IteratorGetNext]
(I hope I got the output_types right though)
Maybe let's ask the TFIO team ?
Could you provide a link to your notebook? I would like to give it a try.
I simply ran this
import pyarrow as pa
import tensorflow as tf
import tensorflow_io.arrow as arrow_io
from pyarrow.feather import write_feather
table = pa.table({
"audio": [{"path": "foo", "bytes": b"bar"}]
})
write_feather(table, "tmp.feather")
ds = arrow_io.ArrowFeatherDataset(
"tmp.feather", columns=(0,), output_types=({"audio": {"bytes": tf.string, "path": tf.string}},)
)
@lhoestq refer to this notebook once: https://gist.github.com/sayakpaul/f7d5cc312cd01cb31098fad3fd9c6b59#file-feather-v2-struct-tfio-ipynb.
I serialized the tables like so:
table = pa.table(
{"path": batch_image_paths, "image_bytes": data, "label": batch_image_labels}
)
From the documentation of ArrowFeatherDataset
:
output_types => Tensor dtypes of the output tensors
I am not sure if it's possible to pass raw dictionaries to it output_types
like you did here.
As per https://github.com/tensorflow/io/issues/1705, having the struct in the output_type
is a requirement. I am not sure why this needs to be explicitly enforced.
I'm just not sure how to properly specify the output_types in this case tbh ^^'
If we follow:
table = pa.table(
{"path": batch_image_paths, "image_bytes": data, "label": batch_image_labels}
)
then (full code here):
output_types=(tf.string, tf.string, tf.int64)
The user just needs to know the schema (in particular the dtypes) of the struct.
I see ! The thing what output_types do we set when the data is nested in a struct like {"audio": [{"path": "foo", "bytes": b"bar"}]}
? I couldn't find a way to do make it work
Okay. I am unsure why having such a nested structure is required. The structure shown in https://github.com/huggingface/datasets/issues/4542#issuecomment-1226996068 doesn't suffice?
That's because the Image
and Audio
type in datasets are actually backed by an Arrow struct
type containing one field for the bytes
and one field for the file name. So currently you would have to do an extra conversion step to have a flat structure, and this requires to copy the dataset which can take lots of time and disk space.
Hmm got it. Then I guess we're back to the TFIO team.
Which encoding scheme you'd suggest? TFRecord (tf.train.Example) solves this problem quite nice.
On Thu, Jun 23, 2022, 18:42 Quentin Lhoest @.***> wrote:
If I understand correctly in your PoC you store the flattened array of pixels in the feather file. This will take a lot of disk space.
Maybe we could just save the encoded bytes and let users apply a map to decode/transform them into the format they need for training ? Users can use tf.image to do so for example
— Reply to this email directly, view it on GitHub https://github.com/huggingface/datasets/issues/4542#issuecomment-1164392223, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFPE2TFGKRMQLINHC4MWZXLVQRPEPANCNFSM5ZQPZVKA . You are receiving this because you were mentioned.Message ID: @.***>
To have better performance in TensorFlow, it is important to provide lists of data files in supported formats. For example sharded TFRecords datasets are extremely performant. This is because tf.data can better leverage parallelism in this case, and load one file at a time in memory.
It seems that using
tensorflow_io
we could have something similar forto_tf_dataset
if we provide sharded Feather files: https://www.tensorflow.org/io/api_docs/python/tfio/arrow/ArrowFeatherDatasetFeather is a format almost equivalent to the Arrow IPC Stream format we're using in
datasets
: Feather V2 is equivalent to Arrow IPC File format, which is an extension of the stream format (it has an extra footer). Therefore we could store datasets as Feather instead of Arrow IPC Stream format without breaking the whole library.Here are a few points to explore
We would also need to implement sharding when loading a dataset (this will be done anyway for #546)
cc @Rocketknight1 @gante feel free to comment in case I missed anything !
I'll share some files and scripts, so that we can benchmark performance of Feather files with tf.data