tensorflow / io

Dataset, streaming, and file system extensions maintained by TensorFlow SIG-IO
Apache License 2.0
692 stars 281 forks source link

tfio.IOTensor in paralellized audio tf.data.Dataset pipeline #1647

Open Lamomal opened 2 years ago

Lamomal commented 2 years ago

I have started using tfio.audio for data loading as I found its great features (such a lazy loading) very useful and fitting well my use case. However, I have encountered an issue when using audio loading within tf.data.Dataset. To incorporate the tfio loading into a mapping function, I followed the solution described here. Everything works as expected when using 1 process (num_parallel_calls=1) for a data loading map function. The problem emerges when using parallelization (num_parallel_calls>=2). In that case, the output of data loading is wrong (in a non-deterministic way). The number of produced examples is correct. But consecutive output signals might be equal (in terms of length and samples) despite the fact they correspond to different file names and also underlying data.

The enclosed code snippet demonstrates the issue. To reproduce the issue with provided audio examples (from Voxceleb 2), it is required to copy audio.zip to a user-defined directory and unzip. Then, one can run the code in the same directory. Thank you for looking into this issue.

TensorFlow version: 2.8.0 TensorFlow I/O version: 0.24.0 Python version: 3.9

import os
import glob
import tensorflow as tf
import tensorflow_io as tfio

num_parallel_calls = 2 # 1 -> correct, >=2 -> wrong
audio_dir = 'audio'
file_names = glob.glob(os.path.join(audio_dir, '*.wav'))

def load_audio(file_name):
    raw = tfio.IOTensor.graph(tf.int16).from_audio(file_name)
    return raw.to_tensor()

dataset = tf.data.Dataset.from_tensor_slices(file_names)
dataset = dataset.map(load_audio, num_parallel_calls=num_parallel_calls, deterministic=True)

for example in dataset:
    print(example.shape)

Expected output:

(150528, 1)
(244736, 1)
(159744, 1)
(80896, 1)
(68608, 1)
(88064, 1)
(166912, 1)

One of the outputs for num_parallel_calls = 2:

(244736, 1)
(244736, 1)
(159744, 1)
(68608, 1)
(68608, 1)
(166912, 1)
(166912, 1)
retunelars commented 1 year ago

I'm seeing similar issues when using num_parallel_calls > 1, which makes it impossible to use in an input pipeline. Maybe there is something in the underlying implementation that is not thread-safe somewhere...

kenders2000 commented 2 months ago

Ahaha, I have been pulling my hair out having the same problem, I have a dataliader using this and I find that when I increase the number of parallel calls I get truncated audio in a non deterministic way.

I see this post is from 2023, and the issue is still open, so I might be better off avoiding tfio for audio data loading. Unless anyone has a solution?