Open NumesSanguis opened 5 years ago
In this case, because your don't need to apply transformation to your input, you can plug the ds directly into Keras if you use the as_supervised
. It will return a tuple (input, target)
instead of a dict
ds = tfds.load('dataset', split='train', as_supervised=True)
model.fit_generator(ds_train, epochs=5)
For images, you would have in addition to cast/normalize the image to tf.float32.
def _normalize_img(img, label):
img = tf.cast(img, tf.float32) / 255.
return (img, label)
ds = tfds.load('mnist', split='train', as_supervised=True)
ds = ds.map(_normalize_img)
model.fit_generator(ds_train, epochs=5)
The as_supervised=True
did the trick. Thank you!
Let me know if the following is better handled in separate issue.
The data I'm dealing with is audio data which for every file is of different length. Therefore, I want to apply a sliding window on every sample.
Something like ds_train.window(size=64, shift=32).flat_map(lambda x: x.batch(128))
won't work, because that's a sliding window over batches and I need a sliding window over every element.
I wrote a sliding_window function in Numpy before, so I thought I would use it with your suggested ds.map()
. However, I noticed ds_train = tfds.load('dataset', split='train', as_supervised=True)
doesn't provide Tensor's as input for your map function?
Normal to and from converstion Tensor <--> Numpy:
tensor_test = tf.convert_to_tensor(np.array([.3, .6, .2, .7]).reshape(2, 2), dtype=tf.float32)
display(tensor_test)
display(np.array(tensor_test)) # or tensor_test.numpy()
Output:
<tf.Tensor: id=3789, shape=(2, 2), dtype=float32, numpy=
array([[0.3, 0.6],
[0.2, 0.7]], dtype=float32)>
[[0.3 0.6]
[0.2 0.7]]
When using .map()
:
def _map_test(x, y):
print(x)
print(np.array(x)) # or print(x.numpy())
return x, y
ds_train = ds_train.map(_map_test)
Output:
Tensor("args_0:0", shape=(None,), dtype=float32)
Tensor("args_0:0", shape=(None,), dtype=float32)
Therefore, it seems that tfds.builder().load(as_supervised=True)
output is somewhat different from normal Tensors?
Questions:
GeneratorBasedBuilder
?Tensor("args_0:0", shape=(None,), dtype=float32)
as a normal Numpy array and then convert it back to a Tensor in the .map()
function?All functions executed inside ds.map
are executed in graph mode, even in TF 2.0 and eager. This is in order to have efficient input pipeline performances.
I think there was a decorator @py_func or something similar to convert into python and numpy array.
Otherwise, I haven't really understood what you want to accomplish with the sliding window.
Note that you have a padded_batch
which allow to batch tensors of different lengths.
https://www.tensorflow.org/api_docs/python/tf/data/Dataset#padded_batch
Thank you for your advice, I will take a look at @py_func.
Sorry for not being clear. 1 audio sample can range from 1 to 10 minutes. If you even have a low sample rate of 8000 and a 1 minute recording, that is 8000*60 = 480.000 float samples as input. That is too big for an input layer, and if you pad it to 10 minutes, that's even more impossible.
Want I want to do is to retrieve 1 audio sample from the Dataset, e.g. 2 minutes, meaning an array of shape (960000, )
. Then with a .map(sliding_window)
function reshape it e.g. (3750, 256)
. That means e.g. a LSTM sees an array with a fixed input size of 256 at a time.
From my pre-TF 2.0 (Keras) experience, different batch sizes work if it's only 1 sample per batch?
Oh I see, thanks for the explaination. If I understand correctly, you could wrap the window
inside a flat_map
. Here is a prof of concept:
# Ds producing 3 long sequences of different lengths
def generator_sequence():
yield list(range(10))
yield list(range(13))
yield list(range(6))
ds = tf.data.Dataset.from_generator(generator_sequence, output_types=tf.int64)
def split_sequence(x):
# Reshape the sequence in smaller sequences of 5 tokens
window_size = 5
sub_ds = tf.data.Dataset.from_tensor_slices(x)
sub_ds = sub_ds.window(size=window_size, shift=window_size)
sub_ds = sub_ds.flat_map(lambda x: x.batch(window_size))
return sub_ds
ds = ds.flat_map(split_sequence)
for ex in ds:
print(ex)
Output:
tf.Tensor([0 1 2 3 4], shape=(5,), dtype=int64)
tf.Tensor([5 6 7 8 9], shape=(5,), dtype=int64)
tf.Tensor([0 1 2 3 4], shape=(5,), dtype=int64)
tf.Tensor([5 6 7 8 9], shape=(5,), dtype=int64)
tf.Tensor([10 11 12], shape=(3,), dtype=int64)
tf.Tensor([0 1 2 3 4], shape=(5,), dtype=int64)
tf.Tensor([5], shape=(1,), dtype=int64)
To add the labels, you could use ds.zip((sub_ds, ds_labels))
with ds_labels = tf.data.Dataset.from_tensor(label).repeat(-1)
inside the split_sequence
function
@Conchylicultor Thank you for your code example! It works, but in TF 2.0 it throws the warning:
W0520 02:13:22.157514 140508707882752 deprecation.py:323] From /usr/local/lib/python3.5/dist-packages/tensorflow/python/data/ops/dataset_ops.py:410: py_func (from tensorflow.python.ops.script_ops) is deprecated and will be removed in a future version.
Instructions for updating:
tf.py_func is deprecated in TF V2. Instead, there are two
options available in V2.
- tf.py_function takes a python function which manipulates tf eager
tensors instead of numpy arrays. It's easy to convert a tf eager tensor to
an ndarray (just call tensor.numpy()) but having access to eager tensors
means `tf.py_function`s can use accelerators such as GPUs as well as
being differentiable using a gradient tape.
- tf.numpy_function maintains the semantics of the deprecated tf.py_func
(it is not differentiable, and manipulates numpy arrays). It drops the
stateful argument making all functions stateful.
Taking the GeneratorBasedBuilder
approach I have less luck however, as it throws an AttributeError: 'Tensor' object has no attribute 'window'
error.
Standalone example:
# pip install tensorflow_datasets
import tensorflow_datasets as tfds
import tensorflow as tf
import numpy as np
# should see 2.x.x-x, not 1.x
print(tf.__version__)
# TODO(emp_dataset_tmj): BibTeX citation
_CITATION = """
"""
# TODO(emp_dataset_tmj):
_DESCRIPTION = """Sliding Window on audio samples
"""
class SlidingDataset(tfds.core.GeneratorBasedBuilder): # TEST: number in class to prevent "name already registered" error
"""TODO(sliding_dataset): Short description of my dataset."""
# TODO(sliding_dataset): Set up version.
VERSION = tfds.core.Version('0.1.0')
def _info(self):
# TODO(sliding_dataset): Specifies the tfds.core.DatasetInfo object
return tfds.core.DatasetInfo(
builder=self,
# This is the description that will appear on the datasets page.
description=_DESCRIPTION,
# tfds.features.FeatureConnectors
features=tfds.features.FeaturesDict({
# These are the features of your dataset like images, labels ...
"audio_description": tfds.features.Text(),
"audio": tfds.features.Tensor(dtype=tf.float32, shape=(None,)),
"label": tfds.features.ClassLabel(names=["F", "M"]), # num_classes=2 # M / F (testing)
}),
# If there's a common (input, target) tuple from the features,
# specify them here. They'll be used if as_supervised=True in
# builder.as_dataset.
supervised_keys=('audio', 'label'), # TODO set with config
# Homepage of the dataset for documentation
urls=[],
citation=_CITATION,
)
def _split_generators(self, dl_manager):
"""Returns SplitGenerators."""
# TODO(sliding_dataset): Downloads the data and defines the splits
return [
tfds.core.SplitGenerator(
name=tfds.Split.TRAIN,
# TODO(sliding_dataset): Tune the number of shards such that each shard
# is < 4 GB.
num_shards=2,
# These kwargs will be passed to _generate_examples
gen_kwargs={
"foo": "spam",
},
),
tfds.core.SplitGenerator(
name=tfds.Split.TEST,
# TODO(sliding_dataset): Tune the number of shards such that each shard
# is < 4 GB.
num_shards=1,
# These kwargs will be passed to _generate_examples
gen_kwargs={
"foo": "eggs",
},
),
]
def _generate_examples(self, foo):
"""Yields examples."""
# TODO(sliding_dataset): Yields examples from the dataset
for length in [10, 13, 5]:
if length % 2 == 0: # even female
lbl = "F"
else: # odd male
lbl = "M"
yield {
"audio_description": foo,
"audio": np.arange(length, dtype=np.float32) / 100,
"label": lbl,
}
builder = tfds.builder("sliding_dataset")
builder.download_and_prepare()
def split_sequence(ds_train, label):
# Reshape the sequence in smaller sequences of 5 tokens
window_size = 5
#sub_ds = tf.data.Dataset.from_tensor_slices(x)
ds_train = ds_train.window(size=window_size, shift=window_size)
ds_train = ds_train.flat_map(lambda x: x.batch(window_size))
return ds_train, label
ds_train = ds_train.flat_map(split_sequence)
```
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
In my numpy sliding window I was using (sorry, not sure if I'm allowed to share more code):
# Inspired by: https://stackoverflow.com/questions/45730504/how-do-i-create-a-sliding-window-with-a-50-overlap-with-a-numpy-array
view = np.lib.stride_tricks.as_strided(soundwave, strides=strides, shape=shape)[0::self.step_width]
, which creates a 2D view of the 1D array. This creates a memory-sharing view, and therefore doesn't copy any data. Helping performance.
The closest I could find in TF is: tf.strided_slice
, however I couldn't manage to get the same result as in Numpy.
While continuing exploring this problem, I came across the groove.py
example. They reshape the data in _split_generators()
function itself, meaning it can be done in pure Numpy before it is converted to a Tensor.
However, a for loop with copy.deepcopy(pm)
is used, which probably hurts performance for large arrays.
Taking the same idea, I integrated the SlidingWindow function, which I already made before, in _split_generators()
and return a 2D view of the audio array.
_info(self):
the audio shape has to be defined with: tfds.features.Tensor(dtype=tf.float32, shape=(None, sliding_window_width))
, because 2 None
's are note allowed. That means I e.g. have to create tfds.core.BuilderConfig
for every sliding window width. Which means I will get multiple copies of the dataset, which only differ in the output shape..map()
/ .flat_map
is more flexible. Will I suffer performance issue if in this function I convert the Tensor to Numpy, modify it, and it gets converted back to a Tensor? If this is a smart approach, do you know how I solve the issue of np.array(Tensor)
not converting the Tensor to a numpy array (instead having a Tensor("args_0:0", shape=(None,), dtype=float32)
)?tf.data.Dataset.from_generator()
approach, but I heard that no Graph is constructed then, losing some advantages of TF 2.0?tfds.core.GeneratorBasedBuilder
, completely putting an interface between data and training. Everything is contained in this class and another user only has to choose the config right for him/her. If I use tf.data.Dataset.from_generator()
, it would require some custom made solution for modifying the generator function, taking the advantage away of specifying everything in a DataSet class. Is it possible to combine GeneratorBasedBuilder
with tf.data.Dataset.from_generator()
, which would basically be GeneratorBasedBuilder
minus the creation of .tfrecord
files (preventing the creation of multiple versions of the same data, with only the shape different)?map()
?Sorry for taking up so much of your time, but I really appreciate your insights and I hope it will help others dealing with variable length audio data!
In the code you posted, you get a AttributeError: 'Tensor' object has no attribute 'window'
because you commented the sub_ds = tf.data.Dataset.from_tensor_slices(x)
line.
You cannot apply window()
on tensor. Only on dataset object, so you need to convert the Tensor to Dataset first.
tf.data.Dataset.from_generator()
was just for demonstration purpose. You can plug any tf.data.Dataset
object, including ds returned by tfds: ds = tfds.load('xyz')
.
Thank you for your explanation. For some reason I assumed from_tensor_slices(x)
was for converting Numpy to Tensor, therefore I commented it out. I'll play around a bit more to get the output to my liking. If no problems related to topic comes up in the next days, I'll close it.
A question related to the original title: How to use a dataset for training that hasn't specified any supervised key, such as the "nsynth" database?
(on colab): ds_train, ds_test = tfds.load(name="nsynth", split=["train", "test"], batch_size=4, as_supervised=True, data_dir="gs://tfds-data/datasets")
ValueError Traceback (most recent call last)
<ipython-input-26-02f5a1f7fa0c> in <module>()
----> 1 ds_train, ds_test = tfds.load(name="nsynth", split=["train", "test"], batch_size=4, as_supervised=True, data_dir="gs://tfds-data/datasets")
7 frames
/usr/local/lib/python3.6/dist-packages/tensorflow_datasets/core/dataset_builder.py in _build_single_dataset(self, split, shuffle_files, batch_size, as_supervised)
386 raise ValueError(
387 "as_supervised=True but %s does not support a supervised "
--> 388 "(input, label) structure." % self.name)
389 input_f, target_f = self.info.supervised_keys
390 dataset = dataset.map(lambda fs: (fs[input_f], fs[target_f]),
ValueError: as_supervised=True but nsynth does not support a supervised (input, label) structure.
If you remove as_supervised
and try to use: model.fit_generator(ds_train.take(16), epochs=2)
You'll get a : ValueError: Output of generator should be a tuple
(x, y, sample_weight)or
(x, y). Found: {'audio': <tf.Tensor: id=1357, ...
Full colab notebook: https://colab.research.google.com/drive/1h4G_I6Zp9rc1Ib0m1SUi3oxa0iCETN-D
Would it be possible to add an example for these cases? No tutorial has an actual use-case:
If so, you should not use as_supervised =True
and the ds
will return the original dict:
for ex in tfds.as_numpy(tfds.load(...)):
ex['audio']
print(ex.keys()) # Print the available fields
You can see the fields names which will be returned in the doc: https://www.tensorflow.org/datasets/datasets#nsynth
A yes, but how do I pass them to fit_generator
Keras / beginner style, not expert style (https://www.tensorflow.org/beta/tutorials/quickstart/beginner)?
It's not like one of the following work:
model.fit_generator((ds_train["audio"], ds_train["pitch"]), epochs=2)
model.fit_generator([ds_train["audio"], ds_train["pitch"]], epochs=2)
model.fit(ds_train["audio"], ds_train["pitch"], epochs=2)
@NumesSanguis Hi. What if I have to use dict based datasets with keras model's fit function? For example, I have several inputs and several outputs, so I prepare the dataset as what tf.estimator used (feature dict and label dict).
inputs = tf.data.Dataset.from_tensor_slices({"input1": input1, "input2": input2, "input3": input3})
labels = tf.data.Dataset.from_tensor_slices("{"label1": label1, "label2": label2})
@npuichigo I think you mentioned the wrong person. I was also just trying to get it to work. In this chat, @Conchylicultor is one of the maintainers of this repository and therefore more knowledgeable.
And what is actually your question? You want to know if, and how, a dict can be turned into a DatasetBuilder
?
@NumesSanguis Thank you for reminding me. I want to know how to use dict like Dataset with keras.Model.fit.
@Conchylicultor
I have a Tensorflow dataset emitting : {"char_ids" : Tensor(....), "word_ids" : Tensor(...)}, Tensor(...)
By passing the features as it is I am able to access the dict inside my model call
function, however I loose the batch size
info completely even if I set True for drop the remainder in dataset APIs.
If I use this with Keras fit() or fit_generator(), the batch size information erased, which is much needed for model calculation.
What I need help with / What I was wondering I've created a Dataset by using
tfds.core.GeneratorBasedBuilder
and want to train a model with it in Keras style. I followed the tutorial: https://github.com/tensorflow/datasets/blob/master/docs/add_dataset.md#datasetbuilderI was wondering how to use this dataset to train a model Keras style with TF 2.0. I want to make the model work with minimal code that looks like this:
With Keras, I was used to using model.fit_generator() for this purpose, however that will result (not unexpectedly) in the error:
Full Traceback (CLICK ME)
```python Epoch 1/5 --------------------------------------------------------------------------- ValueError Traceback (most recent call last) in
----> 1 model.fit_generator(ds_train, epochs=5)
/usr/local/lib/python3.5/dist-packages/tensorflow/python/keras/engine/training.py in fit_generator(self, generator, steps_per_epoch, epochs, verbose, callbacks, validation_data, validation_steps, validation_freq, class_weight, max_queue_size, workers, use_multiprocessing, shuffle, initial_epoch)
1513 shuffle=shuffle,
1514 initial_epoch=initial_epoch,
-> 1515 steps_name='steps_per_epoch')
1516
1517 def evaluate_generator(self,
/usr/local/lib/python3.5/dist-packages/tensorflow/python/keras/engine/training_generator.py in model_iteration(model, data, steps_per_epoch, epochs, verbose, callbacks, validation_data, validation_steps, validation_freq, class_weight, max_queue_size, workers, use_multiprocessing, shuffle, initial_epoch, mode, batch_size, steps_name, **kwargs)
211 step = 0
212 while step < target_steps:
--> 213 batch_data = _get_next_batch(generator, mode)
214 if batch_data is None:
215 if is_dataset:
/usr/local/lib/python3.5/dist-packages/tensorflow/python/keras/engine/training_generator.py in _get_next_batch(generator, mode)
363 raise ValueError('Output of generator should be '
364 'a tuple `(x, y, sample_weight)` '
--> 365 'or `(x, y)`. Found: ' + str(generator_output))
366
367 if len(generator_output) < 1 or len(generator_output) > 3:
ValueError: Output of generator should be a tuple `(x, y, sample_weight)` or `(x, y)`. Found: {'audio_description': , 'audio': , 'label': }
```
Question: How to use a
DatasetBuilder
withmodel.fit_generator
in a single/few lines of code Keras style?What I've tried so far Following the expert introduction to TF 2.0, I got this to work with minimal changes:
However, that the destroys the new philosophy of easy to use TF 2.0 with Keras.
It would be nice if... Please provide an example, Keras style, on how to easily use
DatasetBuilder
/GeneratorBasedBuilder
to train a model and not stopping atfor features in ds_train:
.Environment information (if applicable)
docker run -it --runtime=nvidia --rm -v /home/notebooks:/tf/notebooks -p 8889:8888 tensorflow/tensorflow:2.0.0a0-gpu-py3-jupyter
tensorflow-datasets
version: '1.0.2' (tensorflow_datasets.version.__version__
)tensorflow-gpu
version: 2.0.0-alpha