rwth-i6 / returnn

The RWTH extensible training framework for universal recurrent neural networks
http://returnn.readthedocs.io/
Other
349 stars 130 forks source link

OggZipDataset too slow #1256

Closed JackTemaki closed 1 year ago

JackTemaki commented 1 year ago

I did some tests on faster GPUs (Nvidia 3090) with current pipelines, and even if the feature extraction is in the network the data loading is blocking the training (GPU utilization in nvidia-smi only at 60-70%, Computing time at 90% or lower).

This means we urgently need to look for an alternative. Some people are already using HDFs containing raw wave, but I need the compression capabilities in my setups. I think the easiest approach would be a custom HDFDataset derivation which stores the raw ogg data inside and decodes it on-the-fly. If this is still too slow because we only have single-threaded data loading, we need to think about how we can completely replace the current data loading approach.

JackTemaki commented 1 year ago

If only the .ogg decoding is blocking, we could also introduce some /var/tmp based caching into the OggZip dataset, storing the raw wave.

albertz commented 1 year ago

We should do some investigation what exactly is the bottleneck.

albertz commented 1 year ago

Whatever the bottleneck is, we probably can find some solution.

One quite generic solution is also to introduce multi-threading (or multi-processing) to make use of multiple CPUs.

If some bottleneck is just due to Python, we can also replace it by some native implementation. E.g. I have seen that PyTorch has some C++ support for reading ZIP files.

JackTemaki commented 1 year ago

We should do some investigation what exactly is the bottleneck.

* ZIP reading

* OGG decoding

* Feature extraction

* Other stuff

My current testing setup does not include feature extraction, I already confirmed this is a heavy bottleneck and removed it. We got up to 50% speed increases afterwards.

albertz commented 1 year ago

I think the easiest approach would be a custom HDFDataset derivation which stores the raw ogg data inside and decodes it on-the-fly.

The decoding can also be done by TF. I think even on GPU. Also batched.

JackTemaki commented 1 year ago

I think the easiest approach would be a custom HDFDataset derivation which stores the raw ogg data inside and decodes it on-the-fly.

The decoding can also be done by TF. I think even on GPU. Also batched.

Yes, but we can not have that as final solution if we want an easy integration of data augmentation methods.

I completely forgot about the speed perturbation, it is active in 2/3 of the sequences, and in a rough average 30% of the time is ogg decoding and 70% is speed perturbation in my setup.... the rest (text->bpe, zip file opening) was a magnitude or more below that.

So using HDF will not solve the problem, only multi threaded data preparation does.

albertz commented 1 year ago

Yes, but we can not have that as final solution if we want an easy integration of data augmentation methods.

I don't understand. All data augmentation can (and should) also be part of the computation graph?

albertz commented 1 year ago

speed perturbation

How do you do this? You are aware of the rnd_zoom_order option? (#1119)

JackTemaki commented 1 year ago

Yes, but we can not have that as final solution if we want an easy integration of data augmentation methods.

I don't understand. All data augmentation can (and should) also be part of the computation graph?

No, they should not. In research you want fast exploration of new methods, and you do not want to spend plenty of time to re-implement existing functions from librosa or scipy as Tensorflow graph (tf.signal is still very limited).

albertz commented 1 year ago

Yes, but we can not have that as final solution if we want an easy integration of data augmentation methods.

I don't understand. All data augmentation can (and should) also be part of the computation graph?

No, they should not. In research you want fast exploration of new methods, and you do not want to spend plenty of time to re-implement existing functions from librosa or scipy as Tensorflow graph (tf.signal is still very limited).

I don't understand. You don't need to reimplement anything, and you can still easily wrap whatever any existing implementation inside the TF computation graph. For example, there is py_function or py_func as one way to do this.

JackTemaki commented 1 year ago

Yes, but we can not have that as final solution if we want an easy integration of data augmentation methods.

I don't understand. All data augmentation can (and should) also be part of the computation graph?

No, they should not. In research you want fast exploration of new methods, and you do not want to spend plenty of time to re-implement existing functions from librosa or scipy as Tensorflow graph (tf.signal is still very limited).

I don't understand. You don't need to reimplement anything, and you can still easily wrap whatever any existing implementation inside the TF computation graph. For example, there is py_function or py_func as one way to do this.

How can you do it easily if you already have a batch? This is seems to be much more effort than just calling x = some_lib.some_function(x)

albertz commented 1 year ago

You can easily just loop over the sequences in the batch. And if you think that is too much boilerplate, you can also introduce some helper wrapper to do that for you. This is all very easy. There is no magic here.

Having everything in the computation graph has so many advantages that we shouldn't really discuss this. E.g. when you have some Python-based implementation (then wrapped), you can replace it at some later point by a more efficient TF implementation. And you can do other things before the augmentation also in the graph, e.g. such as feature extraction.

JackTemaki commented 1 year ago

You can easily just loop over the sequences in the batch. And if you think that is too much boilerplate, you can also introduce some helper wrapper to do that for you. This is all very easy. There is no magic here.

Having everything in the computation graph has so many advantages that we shouldn't really discuss this. E.g. when you have some Python-based implementation (then wrapped), you can replace it at some later point by a more efficient TF implementation. And you can do other things before the augmentation also in the graph, e.g. such as feature extraction.

tf.py_function is not parallelizeable (thats what the documentation says) so this will give the same speed disadvantage. And can this then be used in a serialized graph? I do not think so. I would definitely not go with that approach.

I am not sure if it is even worth to go further in that direction, maybe we should just switch to the PyTorch backend for all new things we do (offtopic: mixed precision is much much much better documented for PyTorch than for TF, especially if you do not use a keras-only setup). And then we can use the DataLoader from PyTorch and just have to make sure the OggZip dataset is thread-safe.

albertz commented 1 year ago

tf.py_function is not parallelizeable (thats what the documentation says) so this will give the same speed disadvantage. And can this then be used in a serialized graph? I do not think so. I would definitely not go with that approach.

I don't understand those arguments. So py_function has no disadvantage compared to your suggestion, but only advantages like those I described? So, what's the problem then? You can apply multi-threading just with or without py_function. py_function is not relevant there. Having the data augmentation code outside of your graph also implies that your data augmentation is not part of the serialized graph. What's the argument now? Still, py_function allows to have it all in the graph, which is a huge advantage, to have it all unified, and potentially allow for many other things, which are not possible otherwise, e.g. like having everything on GPU.

we can use the DataLoader from PyTorch and just have to make sure the OggZip dataset is thread-safe.

I'm not sure I understand. OggZipDataset does not need to be thread-safe for that. If you want to wrap RETURNN datasets as PyTorch Dataset and then use DataLoader, this will not use any multi-threading. If you use the DataLoader with multiple workers, this will use multi-processing, i.e. thread-safety is also not needed.

So basically nothing needs to be changed.

Btw, we can also do the same for the TF backend, i.e. supporting multi-processing for the datasets. This could be implemented in many ways. Maybe just as another separate meta dataset, like MultiProcDataset. This was already discussed before. Again, for this, nothing needs to be changed on the datasets, like OggZipDataset.

However, I still like to unify more. In TF terminology, having everything in the TF computation graph. But you can also have the equivalent in PyTorch, i.e. operating on PT tensors all the time, potentially batched tensors, potentially everything on GPU, including the OGG decoding.

I'm just arguing that if we want to rethink our dataset pipeline, including data augmentation, we should make it generic, and design it in a way that this is easily possible.

JackTemaki commented 1 year ago

tf.py_function is not parallelizeable (thats what the documentation says) so this will give the same speed disadvantage. And can this then be used in a serialized graph? I do not think so. I would definitely not go with that approach.

I don't understand those arguments. So py_function has no disadvantage compared to your suggestion, but only advantages like those I described? So, what's the problem then? You can apply multi-threading just with or without py_function. py_function is not relevant there. Having the data augmentation code outside of your graph also implies that your data augmentation is not part of the serialized graph. What's the argument now? Still, py_function allows to have it all in the graph, which is a huge advantage, to have it all unified, and potentially allow for many other things, which are not possible otherwise, e.g. like having everything on GPU.

From the documentation:

Calling tf.py_function will acquire the Python Global Interpreter Lock (GIL) that allows only one thread to run at any point in time. This will preclude efficient parallelization and distribution of the execution of the program.

The body of the function (i.e. func) will not be serialized in a GraphDef. Therefore, you should not use this function if you need to serialize your model and restore it in a different environment.

Yes, you can replace it at some point, but before it will give disadvantages (export impossible, more implementation efffort) and there is no speed-gain.

we can use the DataLoader from PyTorch and just have to make sure the OggZip dataset is thread-safe.

I'm not sure I understand. OggZipDataset does not need to be thread-safe for that. If you want to wrap RETURNN datasets as PyTorch Dataset and then use DataLoader, this will not use any multi-threading. If you use the DataLoader with multiple workers, this will use multi-processing, i.e. thread-safety is also not needed.

Timur did that in his PyTorch-Vocoder setup and exactly this did not work yet.

So basically nothing needs to be changed.

Btw, we can also do the same for the TF backend, i.e. supporting multi-processing for the datasets. This could be implemented in many ways. Maybe just as another separate meta dataset, like MultiProcessDataset. This was already discussed before. Again, for this, nothing needs to be changed on the datasets, like OggZipDataset.

However, I still like to unify more. In TF terminology, having everything in the TF computation graph. But you can also have the equivalent in PyTorch, i.e. operating on PT tensors all the time, potentially batched tensors, potentially everything on GPU, including the OGG decoding.

One more reason to switch to PyTorch right away, to be able to use torchaudio.

I'm just arguing that if we want to rethink our dataset pipeline, including data augmentation, we should make it generic, and design it in a way that this is easily possible.

I do not understand why we should design anything, if with PyTorch we do not need to do anything because it is just there (no matter if we do it in the data-loader or in the graph). All we need to do is maybe update some Datasets.

albertz commented 1 year ago

What you quote from the py_function doc is a principled restriction on Python itself, not about py_function. It does not matter where you have that Python code, whether it is somewhere independent or whether it is inside a py_function. It just says that Python always has the GIL and Python does not really support multi-threading.

it will give disadvantages (export impossible, more implementation efffort)

Both are not true, compared to your alternative.

For your alternative, the export is just as impossible. And strictly speaking, "impossible" is also not quite correct. Surely it is possible, you can always wrap it in an TF op, although this might require a bit more effort, but it is certainly possible. But it's actually totally impossible for your approach, as this is totally outside the graph, and thus cannot be serialized, unless you make it part of the graph, but then you have again exactly what I proposed.

But why do you even care about doing that anyway? We normally do not serialize the training graph. And for the recognition graph, this would anyway not be part of the graph, so it does not matter. But the feature extraction, maybe even OGG decoding, could be part of the graph, even for recognition. It might speed up the recognition on GPU, as you have less CPU->GPU memory copying.

More implementation effort is also wrong. It is exactly the same implementation effort, no matter if it is inside a py_function or not.

I'm not sure I understand. OggZipDataset does not need to be thread-safe for that. If you want to wrap RETURNN datasets as PyTorch Dataset and then use DataLoader, this will not use any multi-threading. If you use the DataLoader with multiple workers, this will use multi-processing, i.e. thread-safety is also not needed.

Timur did that in his PyTorch-Vocoder setup and exactly this did not work yet.

I'm not sure what he did. In any case, it's not about multi-threading. Probably he did something wrong. I assume his PyTorch dataset wrapper implementation was probably incorrect.

One more reason to switch to PyTorch right away, to be able to use torchaudio.

I'm not sure I understand your argument, or whether you understood my argument. My argument was, TF and PT are just the same in this respect. So this is not a reason to switch to PT.

I do not understand why we should design anything, if with PyTorch we do not need to do anything because it is just there (no matter if we do it in the data-loader or in the graph). All we need to do is maybe update some Datasets.

I don't understand what you mean. In any case (no matter if TF or PT), you need to think about how to implement things, and what's the dataflow, what is the container file format (HDF? ZIP?), what is the audio file format (OGG? FLAC? WAV? or features?), how to do the processing, feature extraction, data augmentation (multi-proc, multi-threaded, CPU, GPU, batched or not). This needs to be designed. And it doesn't really matter whether it is for TF or PT. That's the point. I'm just saying, it should be unified, i.e. this design should be flexible, for example such that the user easily can move everything to GPU, if he wants. Your suggested approach would make this impossible.

albertz commented 1 year ago

In any case, this whole discussion here is not quite off-topic from the original issue. The original issue was just about OggZipDataset, or not?

For discussing the data pipeline, this is actually issue #292. Issue #292 discussion is around tf.data and about other TF functions, however, all the discussion in there is just as valid for PT, for everything what is said in there, you can do the equivalent thing in PT.

albertz commented 1 year ago

More on topic: As said, we should do some actual profiling here. Most relevant are the question, how fast is ZIP reading, how fast is OGG decoding. OGG reading already is native, and I assume fast enough (but I don't know). ZIP reading, I don't know. I think this is partly pure Python, which makes it slow, so it can be improved by having a native implementation. But we should just test this and know the numbers.

But then again a bit more meta:

It can also turn out, that ZIP is just a bad container format for this in any case, and HDF is just better. In that case, there is no point in further tuning the OggZipDataset. Or maybe not HDF, but some other format? E.g. Apache Parquet (used by HuggingFace), or so.

Maybe it also turns out that OGG decoding is too slow, and other audio formats allow for faster decoding. Or we just decide against having audio compressed.

JackTemaki commented 1 year ago

As the problem and discussion deviated into data processing issues and is not related directly to the OggZipDataset itself, we can close this issue. When using the "bare" pipeline (only loading ogg audio), the zip-opening was (as written already earlier) irrelevant in its time consumption, and the .ogg decoding is in my setup ~0.15 seconds for 7 sequences, while processing the batch takes 0.4 to 0.5 seconds.

albertz commented 1 year ago

Sorry I did not get this. You say you already profiled, and zip reading (not opening) is extremely fast, thus negligible? And ogg decoding is also very fast, thus negligible? Where can I see the times for zip reading?

albertz commented 1 year ago

Also, to further clarify:

This whole issue here then never really was about OggZipDataset, but actually about ExtractAudioFeatures and about _get_random_permuted_audio? Or actually only about _get_random_permuted_audio, as I understand you, you don't use the feature extraction, you operate on raw features?

As I asked initially, are you aware of rnd_zoom_order? Do you use that correctly?

albertz commented 4 months ago

For reference: