conversation: fast, general audio read

drscotthawley commented 5 years ago

I love the benchmarks by @faroit on audio read speeds. I too have tried a number of different libraries, and have generally been using scipy.io.wavfile.read() for its speed whenever possible, but switching to the slower librosa.core.load() when more general reading capabilities are needed (e.g. resampling, or 24-bit WAV files).

And I've just today noticed that sometimes, scipy is throwing Warnings (e.g. when reading 24-bit files) and the operation does complete, and the data "looks ok" but is in fact slightly corrupted, e.g. it's off by 1e-5 on a -1...1 scale.

I think these warnings should be trapped as errors, and I want to share some of my work so that others can benefit.

So, I'm submitting a PR with a little 'generic read' utility, which is adequate for my personal needs but probably not as sophisticated as what you guys are currently using. Nevertheless I'm contributing it to at least "get the conversation started". Feel free to include it or not, and/or overwrite it with your own, better utilities.

I put it in a new file, io.py: https://github.com/drscotthawley/torchaudio-contrib/blob/master/torchaudio_contrib/io.py

The PR is here: https://github.com/keunwoochoi/torchaudio-contrib/pull/30

Presumably someone will want to write a from-scratch utility that uses all torch directives, and doesn't use any intermediary libraries and/or numpy. This is just what I've got right now.

hagenw commented 5 years ago

Thanks for raising this issue, I struggled with this for some time now. Our goal should be:

Read multi-channel audio from different formats like WAV, FLAC, MP3, OGG, MP4 (including movies) as fast as possible and without corrupting the data

For WAV, FLAC, and OGG this can be solved by soundfile which is fast and can handle 24-bit files as well. For other formats I tried librosa first (which uses audioread internally), but it is one of the slowest libraries at hand. In addition, it has also problems with corrupting the data: https://github.com/librosa/librosa/issues/811

In order to tackle that problem I created the audiofile package which is a wrapper around soundfile, sox and the external programs ffmpeg and mediainfo. The fastest and most reliable way I found for reading non WAV, FLAC, or OGG files was to convert them first to a temporary WAV file using sox or ffmpeg and reading them with soundfile afterwards.

To test for correct number of samples, channels, amplitude values, I added lots of tests to make sure the output is as desired.

That is what I use at the moment to handle audio files. My proposal would be to use something similar here, or at least adapt the testing policy for anything we implement here regarding audio file handling. I also think that reading of MP3 and MP4 might still be improved.

At the end some benchmarking results. This was done by adding audiofile to faroit/python_audio_loading_benchmark.

Reading into NumPy array benchmark_np

Accessing metadata for duration, samples, channels, sampling_rate Note, for cases where you see now bar, but only a white overlay on the y-axis, the needed time was around 0 s. Note further that audioread is faster for MP4 files than audiofile, but audioread didn't returned the correct duration/number of samples for me. benchmark_metadata

drscotthawley commented 5 years ago

@hagenw I'm fine with you or someone else overwriting my proposed load & save routine in io.py with calls to your audiofile (although your dependence on sox would seem to violate what was proposed in the README).

It just seemed strange to me that, with all the talk of loading & saving in the README, this package wasn't actually offering any read or write capabilities, or any even links to any. So that's why I added a couple routines (which would be modified).

I'd propose that the load() and save() from torchaudio_contrib still exist, even if they simply call whatever the latest and greatest library is currently out there, and then add that library as a dependency.

hagenw commented 5 years ago

@drscotthawley I agree that it is not completely clear if we should only discuss solutions and API designs here and implement them in https://github.com/pytorch/audio or if we should implement them here.

Regarding the sox dependency, this is indeed still open for discussion. We have to find a solution that on the one hand works as fast as possible and on the other hand requires the least possible external dependencies.

So far, I was not able to find a solution for audiofile without external programs. But even if we need external programs we could design it in a way that the external dependencies are not required during installation and import, but only if you actually want to load the corresponding file format (e.g. MP4).

faroit commented 5 years ago

@drscotthawley Thanks for you initial file support

And I've just today noticed that sometimes, scipy is throwing Warnings (e.g. when reading 24-bit files) and the operation does complete, and the data "looks ok" but is in fact slightly corrupted, e.g. it's off by 1e-5 on a -1...1 scale.

Regarding 24bit, it seems not that difficult to actually manually read 24bit wav files, seen in pydub.

It just seemed strange to me that, with all the talk of loading & saving in the README, this package wasn't actually offering any read or write capabilities, or any even links to any. So that's why I added a couple routines (which would be modified)

Yes, so we thought to do it in three steps (sorry for not being more clear in the repo):

Discuss if we want to remove libsox as the default audio i/o library? ....considering its not that fast and doesn't support a lot of formats and its currently not that easy to install cross platform
If we want to replace it, how would we do that? I would be in for a modular loading backend that would allow users to hook in their own i/o backend easily.
Implement backends

@drscotthawley @hagenw if you agree with 2, we should start working on this in PR, probably superceeding #33 and once this is done decide which backends we want to include as defaults.

faroit commented 5 years ago

@hagenw Package looks great, I will have look later...

So far, I was not able to find a solution for audiofile without external programs. But even if we need external programs we could design it in a way that the external dependencies are not required during installation and import, but only if you actually want to load the corresponding file format (e.g. MP4).

External programs are npt bad per se. For example, I still think ffmpeg is one of the most reliable lib for I/O. I think we should also talk a bit about packaging. Since many people use anaconda, we should probably look for the availability of the libs there. E.g. sox-dev is not there which results in so many people currently not being able to easily install torchaudio. In constrast, ffmpeg has good builds that include a lot of codecs for decoding.

faroit commented 5 years ago

Another thing: before trying to implement python based i/o ops, we should really consider the options of C interfaces as torchaudio is currently doing with sox. So ideally we would need to interface soundfile or ffmpeg from libtorch directly. That way we might get a significant speed-up as well.

I found that torchaudio might not be faster than soundfile for iterating through a linear dataset like in my benchmarks but at least I was able to get better loading performance when used together with multiprocessing.

I haven't evaluated this systematically, but my guess is that the pytorch dataloader expects the dataset __get_item__ to be threadsafe. Therefore os calls to ffmpeg are probably not optimal if we one rely on multiple workers? @hagenw do you have some experience here?

f0k commented 5 years ago

It just seemed strange to me that, with all the talk of loading & saving in the README, this package wasn't actually offering any read or write capabilities,

Yes, it's something that needs to be discussed as well. For now this is more of a construction site and discussion board then a package meant to be used.

Regarding 24bit, it seems not that difficult to actually manually read 24bit wav files

A fun trick is to view them as an int32 array with a stride of 3 bytes and then mask the lowest byte (I prefer this over shifting, since this way you'll cover the full value range of the dtype and don't have to remember how many bits you had). Works well, but you cannot simply memory-map the samples any more.

Therefore os calls to ffmpeg are probably not optimal if we one rely on multiple workers?

Umm, do you mean a subprocess.check_output() call? What would be problematic?

Another thing: before trying to implement python based i/o ops, we should really consider the options of C interfaces

Yes, there was a brief discussion on slack suggesting that we should interface with libav of ffmpeg, not call any binary. In addition, we raised the question of whether we should include our own code for resampling and channel downmixing, to not depend on the ffmpeg version, and to share the same resampling algorithm between multiple backends.

hagenw commented 5 years ago

@f0k I would be in favor of excluding resampling and downmixing from the actual audio file loading methods and implement them independently, e.g. I do it with transforms at the moment. It might be not ideal for data sets that contain mixed sampling rates, but from a design perspective I'm very much in favor of having separated methods for separated things.

drscotthawley commented 5 years ago

@hagenw Perhaps two different load() methods would be in order: one that is bare-bones, and one that is a larger wrapper around that, which provides the necessary resampling or downmixing.

I think being able to "load at a specific sample rate" (i.e. read and resample if necessary) and at certain number of channels needs to be a standard feature for a load method, whether this means calling external libraries like ffmpeg or supplying our own code.

Currently my code only triggers the resampling if it's necessary. But it will save the user from thinking they're loading a 44.1kHz file when in fact it was 48kHz. (Probably a slightly faster method would be to check the metadata first rather than reading the whole file as in my method.)

Often audio datasets are inhomogeneous, and/or sometimes I just want to try something at a reduced sample rate for the sake of speed or memory. Ideally users should pre-process the dataset, but in my experience with Panotti, users so frequently fail to do this that I just got sick of fielding support "Issues" from people who intermingled mono & stereo files, or 44.1kHz and 48kHz, etc., that I just wrote the conversions into the code. In my own (hopefully more careful) research, I still think It's better to build these in to the loader than to find out later that the file loader wasn't giving you what you wanted.

keunwoochoi commented 5 years ago

I'm very much in favor of having separated methods for separated things.

I still think It's better to build these in to the loader than to find out later that the file loader wasn't giving you what you wanted.

Actually, both approaches could be quite problematic. A lot of librosa issue is about why librosa's load does the resampling to 22050 by default because they (including myself) at the first time expected it to load the sampling rate as it is (which is done by setting sr=None which I think could be improved unless there's backward compatibility issue). So, I'd be down for making things more explicit, but we also want to make things convenient as much as possible.

Maybe this can be addressed by having one load() that does the resampling if needed, but making the argument name to be very clear and required? e.g.,

def load_audio(path:str, resample_to:int, downmix: bool)

i.e., resample_to= rather than sr=, and all of them are required, so the users should notice some unexpected behaviors. This would be asking people to think about sampling rate and the number of channels (so they can't go a dumb way) and I think that's probably what I'd encourage people to. Well, to configure the network they should be aware of how many channels they'll use, and if they missed what sampling rate to use actually it's good to remind of.

hagenw commented 5 years ago

I see the convenience point of adjusting the signals during loading. Using more explicit argument names would be a welcomed improvement over librosa.

Short comment on downmix: to be fair in comparison to resampling it should maybe more like this:

def load_audio(path:str, resample_to:int, remix_to:int)

as with multi-channel audio you might also want to upmix some files.

A related problem is what additional arguments we want to offer here. For example, in the remixing and resampling transforms we use internally (BTW we are working on making that code available as well) we provide several arguments to specify the used methods. Do we want to provide all those arguments for audio loading as well, or just use some default resampling and remixing algorithms?

keunwoochoi commented 5 years ago

@hagenw Yeah, I also thought the same. It would depend on what functionality we'd have though - if it's only N->1 downmix, maybe keep it as downmix or downmix_to_mono, otherwise remix_to would fit better.

faroit commented 5 years ago

just want add that if we talk about spectrogram models, you would want to do the downmix there; and not in the time domain, to maximize the channel energy.

hagenw commented 5 years ago

Interesting, but it does not really affect what we decide for load_audio or will it have some implications?

faroit commented 5 years ago

I would generally vote against any pure python or external tool based loading or pre-processing code for torchaudio. It may be convenient, but training loops would need to efficient to maximize gpu utilisation. Also we will end up with too many installation and packaging issues. Currently torchaudio has already the problem that libsox-dev is not that easy to install on some platforms as it isn't available on conda-forge.

We could possibly look for a cuda enabled resampling, I guess, but I would propose to handle efficient basic audio loading for different codecs first.

Who would be able to write a libsndfile and/or FFMPEG C interface for libtorch?
Could we look at TF 2.0 how they did implement their new decode_wav?

faroit commented 5 years ago

Interesting, but it does not really affect what we decide for load_audio or will it have some implications?

No, its fine. Its just a remark the might add a mono flag for Spectrogram

hagenw commented 5 years ago

@faroit do you think it is really necessary to write an own libsndfile interface instead of using soundfile?

faroit commented 5 years ago

@faroit do you think it is really necessary to write an own libsndfile interface instead of using soundfile?

its not necessary, but

it would make it easier to call from libtorch (which will be important for production/deployment)
we have fewer python requirements
it will probably a bit faster

f0k commented 5 years ago

def load_audio(path:str, resample_to:int, remix_to:int)

Oh, a name discussion :heart:! I like resample_to as a clearer alternative to sample_rate. In some of my code, i used sample_rate=None to load as-is, and sample_rate=x to bail out if the file does not have the expected sample rate. downmix is definitely too limited, we want users to be able to specify the number of channels. We may want to go from 4 to 2 or from 1 to 3. The term remix_to seems a bit too broad to me. It doesn't only call for a number of channels, it looks like I could do remix_to='dubstep'. Hmm.

Of course it should still be possible to load the file as-is.

Well, to configure the network they should be aware of how many channels they'll use

The network may also be defined to deal with an arbitrary number of channels. It could even change its behaviour depending on the number of channels.

if they missed what sampling rate to use actually it's good to remind of

Again, the pipeline (STFT, filterbank) could adapt to the sample rate of the input instead of resampling the signal before.

(This is actually another point to be discussed: Can we attach metadata to the tensors we pass around? The number of channels and the length is encoded in the shape. But can we attach the rate for the time dimension, which would be the sample rate of a signal or the frame rate of a spectrogram? And the frequency range of a spectrogram? This would allow the pipeline to properly handle inhomogeneous datasets.)

making the argument name to be very clear and required

I'd tend to have it return the audio as-is by default, but I'm also intrigued by the idea to just make the keywords required. It will annoy some people, but avoid headaches for others.

For example, in the remixing and resampling transforms we use internally (BTW we are working on making that code available as well) we provide several arguments to specify the used methods. Do we want to provide all those arguments for audio loading as well, or just use some default resampling and remixing algorithms?

I'd say we will want users to be able to customize this. One way could be to have a resampler and mixer argument that you can pass a Resampler or Mixer instance to. If you don't, it will use the builtin resampler or mixer of the audio loading backend. This would encapsulate whatever other options the resampler or mixer takes. Alternatively, if we're afraid to bloat the torchaudio.load function with this, we can tell people to chain up the loading, mixing and resampling themselves if they want full control, and just provide the building blocks.

As we discussed in #29, we also want the loader to be able to access a part of a file only. The backend should implement this efficiently, if it can, by only accessing and decoding the part that is needed. At the very least, if it cannot seek in the file, it should not decode beyond the end of the required segment. This could be handled by two further arguments seek_to/start_at/start and length that default to None. This would also require a second function that just returns the metadata of a file -- sample rate, number of channels, and length. And on top, for some use cases, it should be possible to open a file and then read consecutive chunks. E.g., if I have a 2-hour 4-channel 192 kHz FLAC file I want to run inference on, I don't want to decode it fully into memory, nor do I want to call load again and again with different start arguments, since that will be slow.

Side question: Are positions and lengths given in seconds or in samples? Seconds would be more convenient, but at some point we will lose sample precision. We could make sure to support fractions.Fraction wherever we deal with times, often it won't need any change in the code.

Speaking of the backend, @faroit had the idea to provide several audio reading backends. There should be a backend for wav files that is as fast as it gets and does not require any additional dependencies besides those needed by pytorch. We could then start adding simple and slow backends (such as calling ffmpeg over a pipe, or using existing Python libraries), and add better ones when we finish them. At no point should users have trouble installing torchaudio -- if some dependencies are not available, some backends will be missing. Backends may be chosen automagically based on the file type, but users should be encouraged to specify the backend explicitly, in order to enhance reproducibility.

What do you think of the proposed requirements / use cases? Is anything missing? If we agree on what we want users to be able to do, we can start discussing the API.

hbredin commented 5 years ago

Side question: Are positions and lengths given in seconds or in samples? Seconds would be more convenient, but at some point we will lose sample precision. We could make sure to support fractions.Fraction wherever we deal with times, often it won't need any change in the code.

In the speech community, most datasets come with annotations in seconds (e.g. who speaks when): therefore, I would go with seconds.

Sample precision is indeed an issue: we want to make sure that cropping a waveform between t=1.234s and t=4.234s (3 seconds long) results in exactly the same number of samples as cropping it between t=0 sand t=3s. With respect to this issue, an API with start and length is probably better than one with start and end (as rounding errors could lead to (end - start) * sample_rate != length * sample_rate)

keunwoochoi commented 5 years ago

an API with start and length is probably better than one with start and end

I agree with it.

The term remix_to seems a bit too broad to me. It doesn't only call for a number of channels, it looks like I could do remix_to='dubstep'. Hmm.

Exactly, that's why I suggested keeping downmix as it is and only changing sr to resample_to. remix_to is difficult - it's hard to define what kinds of output channel numbers we're gonna allow, and how to implement it. I'd be up for leaving it for the users in those cases.

f0k commented 5 years ago

it's hard to define what kinds of output channel numbers we're gonna allow, and how to implement it

ffmpeg can do all combinations, they select something more or less reasonable (but not always, I've seen it output an empty channel when going from 4 channels to 4 channels). And other backends do not need to support all combinations of input and output channels. But all should be able to go from 1 to K by replicating (or expand), and then downmix is not a good name. Maybe go back and just use sample_rate and num_channels/channels instead of verbs? Or resample_to and channels_to/mix_channels_to/mix_to_channels? Or resample_to, downmix_to and allow_upmix?

dansuh17 commented 5 years ago

Who would be able to write a libsndfile and/or FFMPEG C interface for libtorch?

I have experience on using ffmpeg libraries (libavformat.h etc.) to implement bottleneck processes like resampling / decoding in C and then attaching on golang or python interfaces. I could work on it if we settle on the API. However, before considering this option, we need to first remind that ffmpeg is a HUGE set of libraries (shared libraries all amounting to ~200MB), targeted mainly for video processing but not audio processing. A huge chunk of the codes is useless if we're dealing with audio only. I don't think users would want this.

ffmpeg can do all combinations

I agree, but I have a rather conservative opinion of supporting only .wav files or raw .pcm files. I think TF is taking this option as well (tf.audio.decode_wav). Decoding other formats like .mp3, .aac is extremely difficult to implement from scratch. The decoding process should be done by the user (who, if dealing with audio data, most probably have ffmpeg or sox installed on their workstation), as we instead provide fast and simple operations only for PCM or WAV.

If this is the case, a concern arising is that if someone wants to use audio formats having more than 2 channels (5.1, B-format, etc.), I have no idea how we should support those, as I have no experience processing them.

f0k commented 5 years ago

I agree, but I have a rather conservative opinion of supporting only .wav files or raw .pcm files. I think TF is taking this option as well (tf.audio.decode_wav).

I think that rather than comparing to TF, we should compare to torchvision, which also doesn't require users to decode to raw or bmp files.

If this is the case, a concern arising is that if someone wants to use audio formats having more than 2 channels (5.1, B-format, etc.), I have no idea how we should support those

Wave files can have an arbitrary number of channels, and can include information on the channel mapping (using the WAVEFORMATEXTENSIBLE header).

faroit commented 5 years ago

I think that rather than comparing to TF, we should compare to torchvision, which also doesn't require users to decode to raw or bmp files.

this is true but decoding a jpeg or png is a lot simpler than a HE-AAC decoder. And when it comes to other image formats like webp or jpeg2000 things also get messy on the PIL/pillow side.

Now, after all this discussion, I think we should decide/vote what to do. I think we have three camps:

A

Abandon the sox inferface and start from scratch with a clean libtorch implemention that first only supports wav, but will be fast and efficient. Probably the best would be to use libsndfile here as there is also a slight change that they will integrate mp3 support. If we think we need more formats we could then think of ffmpeg/libav later.

B

We stick with sox and propose to change to the api to support things like resample_to and to utilize the builtin effects for data augmentations (#29). Further, we address the installation issues by providing torchaudio conda builds.

C

remove any dependencies and rely on a pure python or pure torch based loader, potentially only support WAV decoding. This is doable, but maybe this would then make torchaudio less useful.

I would be okay with both... so what do you think?

hagenw commented 5 years ago

Both A and B require to include an external C-library. Are there any advantages of redoing the current sox-implementation with libsndfile? Will it be easier to install with libsndfile-dev than sox-dev?

faroit commented 5 years ago

Both A and B require to include an external C-library

yes. thats right. Maybe then there is C (updated the post above) which would remove any dependencies and rely on a pure python or pure torch based loader, which potentially would only support decoding wav files. This is doable, but maybe this would then make torchaudio less useful. I would actually vote for C as I think that the cleaner and faster the better.

Are there any advantages of redoing the current sox-implementation with libsndfile?

speed, and less complexity. But I see your point, redo a c-interface would probably make more sense if we switch to a lib with support of many codecs such as ffmpeg. But so far I didn't see anyone willing to write such an interface.

Will it be easier to install with libsndfile-dev than sox-dev?

yes, since for the additional codecs one would also need to install the libsox-fmt-*. Also libsndfile is on conda-forge and seems to work (not tested) on windows.

In any case, if we get torchaudio and sox on conda we could leave things as they are. @soumith @deNsuh what do you think on this?

f0k commented 5 years ago

Both A and B will require compiling something against external dependencies on installing torchaudio. How easy is it to conditionally compile only what we have dependencies for? To me it seems that whatever interface (ffmpeg, sox, sndfile) we choose, this would be more generic than torchaudio, and should be a separate module. We don't really need a compiled module that directly produces a torch tensor, we can use a compiled module that produces a numpy array and convert it. So this would mean going for C. And then I'd still suggest what I wrote in the last paragraph of https://github.com/keunwoochoi/torchaudio-contrib/issues/31#issuecomment-477551144: We can have multiple backends that are written in pure Python and make use of compiled third-party modules to read various file formats, but none of the third-party modules should be a hard dependency of torchaudio. If none are available, we should still be able to read .wav files.

faroit commented 5 years ago

If none are available, we should still be able to read .wav files.

i totally support this. Again, we should drop sox for a simpler solution that just support wav and handle all other formats in optional modules.

Just a few days ago the new version of tensorflow io was released. They now included a wav dataset and ops that support wav decoding. I think this is the way to go: making wav as fast as it could be and make everything else optional. Do we agree on this?

keunwoochoi commented 5 years ago

I think that's great. For other formats people can use madmom/etc, still a super fast wav loading directly to torch would be great.

dansuh17 commented 5 years ago

@faroit I agree. After supporting WAV, we could discuss how to handle other formats by then. But we should consider expandability in mind while implementing audio reads and design interfaces to which we can plug in various other libraries available for later on.

f0k commented 5 years ago

I think this is the way to go: making wav as fast as it could be and make everything else optional. Do we agree on this?

Agree.

They now included a wav dataset and ops that support wav decoding.

Cool. (Not sure that's the fastest way though.)

Again, we should drop sox for a simpler solution that just support wav and handle all other formats in optional modules.

Where I think that the "optional modules" should be existing third-party modules, and we can decide whether the necessary wrappers would be part of torchaudio or not.

keunwoochoi commented 5 years ago

I think we agree on high-level that a fast and direct wav decoding is the way to go instead of the current sox-based wrapping and piping.

@cpuhrsch Have you guys thought about it before? It'd be another, independent direction of a significant and breaking change we've been discussing so far.

cpuhrsch commented 5 years ago

@keunwoochoi - quickly reading audio files will absolutely become more and more important as transforms and networks are running on faster and faster GPUs. This is similar to the data issues you can experience within computer vision if the data is being fed fast enough.

I'd want to tackle this actively if we have a benchmark that shows that reading audio files is keeping a network from running at full speed. Then there are various ways of trying to improve that by running reads asynchronously and loading data into a buffer first so that we become throughput bound. After that a faster read operator might be necessary to increase throughput.

In general, I'm wondering what the stance on offline conversion of audio datasets into fast audio formats is. It's convenient to be able to read any format, but surely some formats have been designed with speed in mind. Here we might also strike a balance between the amount of memory we need to read and some kind of compression algorithm.

vincentqb commented 5 years ago

For reference: pytorch/pytorch#24915

keunwoochoi / torchaudio-contrib

conversation: fast, general audio read #31

A

B

C