pytorch / audio

Data manipulation and transformation for audio signal processing, powered by PyTorch
https://pytorch.org/audio
BSD 2-Clause "Simplified" License
2.54k stars 654 forks source link

Merging plan from torchaudio-contrib #110

Open keunwoochoi opened 5 years ago

keunwoochoi commented 5 years ago

Hi all, I think it's good timing to discuss a potential merging plan from torchaudio-contrib to here, especially because there's going to be new features and changes by @jamarshon @cpuhrsch.

Main idea

A lot of things are well summarized in https://github.com/keunwoochoi/torchaudio-contrib. In short, we wanted to re-design torch-based audio processing so that

Review - layers

. torchaudio-contrib already covers lots of functions that transform.py is covering now, but not all of them. And that's why I feel like it's time to discuss this here. Let me list the classes in transform.py one by one with some notes.

1. Already in torchaudio-contrib. Hoping we'd replace.

2. Wouldn't need these

3. To-be-added

4. Not sure about these

Review - argument and variable names

As summarised --> https://github.com/keunwoochoi/torchaudio-contrib/issues/46, we'd like to use

Audio loading

@faroit has been working on replacing Sox with others. But here in this issue, I'd like to focus on the topics above.

So,

ksanjeevan commented 5 years ago

class MelScale: we have it and would like to suggest to change the name to something more general.

Just making sure it's clear: as it currently stands, our MelFilterbank is a wrapper around the create_mel_filter functional and merely creates the filterbank tensor (either htk or Slaney or even custom). Then we have ApplyFilterbank taking it in and using apply_filterbank functional to actually apply it to the stft.

Personally I'd prefer to have separate classes,DownmixWaveform() and DownmixSpecgram()

:+1: :+1:

complex_specgrams

A bit verbose?

A couple of things that come to mind I think we should also discuss:

hagenw commented 5 years ago

2. Wouldn't need these

I think it is a little bit restrictive to force everything to be a layer and don't allow for composing transforms, which still might be useful during training.

keunwoochoi commented 5 years ago

@hagenw Right, actually I was confused Compose with torchvision’s transform. I’ll probably edit it later, Compose is fairly simple and be able to do something that nn.Sequential() can’t, hence probably no need to remove it. One thing is - it’d be nicer to a suggestion on when to use nn.Sequential() or Compose(). For example, imo, if every component is Layer, a better practice would be to use nn.Sequential().

cpuhrsch commented 5 years ago

Thanks for writing this up @keunwoochoi!

To me there are three categories of features for the sake of this discussion

1) Entirely new standalone features. We can merge those right away. 2) Feature extensions. We can also merge those right away after a bit of discussion on a case-by-case basis to prevent featuritis. 3) Redoing existing torchaudio features. Here it would be good to merge them on a case-by-case basis. Let's start with those that require the least global decisions. For all of these we need to keep backwards compatibility in mind.

On compatibility Then there are other libraries (HTK, librosa, Kaldi, etc.) we might want support / be compatible with. One goal for us has been to allow users to get native PyTorch support for the preprocessing functions they need from these libraries. I think these steps might be a good approach.

a) Support to read and write their common data formats or layouts to and from PyTorch objects (e.g. Kaldi's b) Faithfully reimplement some of their widely used functionalities in plain PyTorch to get native support with all its benefits (e.g. GPU). We should use our core compute backend (functionals.py) here and return PyTorch objects that can be written out into data that bitwise matches their output.

I think we should create a module such as torchaudio.interface.kaldi, torchaudio.interface.htk etc. (maybe with a different name) that creates a separate namespace for this. For example we might add torchaudio.interface.htk.utils.read_mel_filterbank and return a filterbank that aligns with our conventions for class MelFilterbank, but we don't add htk support as another flag. We could also just add torchaudio.interface.layers.MelFilterbank. A lot of this namespacing should probably still be reviewed lest it gets too nested.

I suspect feature requests that fall under b) will mostly be those that users apply online. A lot of work is still offline. That's where the dataformat readers comes in.

Releases and backwards compatability Now on our upcoming releases. The next release will focus on: Freezing current API behavior for backwards compatibility to ease migration. Adding orthogonal features / feature extensions from torchaudio-contrib (functionality of type 1 and 2). Adding some interfaces (e.g. explicit htk and kaldi support).

We might even do the above release as soon as we have some good release engineering support (pip and conda etc.) and will then ship whatever made it in (within reason).

We can then quickly follow up with another release that allows us to deprecate current conventions while providing documentation and backwards compatibility. That release can then include features from category 3).

On migrating from torchaudio-contrib Functionals: We should add functionals first. This should include good test coverage.

Layers: For each feature added through the functionals we then need to look at their exposition via a layer (the state to be cached).

Containers: Compose and Sequential are a bit too opinionated to be part of the core transforms.py. I'd suggest we move this into an examples folder or make it part of a utilities function. This however falls under category 3). If someone wants to combine transforms I'd expect them to call them in sequence without any container or, if they want to, use nn.Sequential etc. explicitly from the core library.

That also means I'd rather we not return an nn.Sequential from MelSpectogram. We can add that combination as an example or if it's very different make it separate transform that applies these in sequence rather than returning an object that does so.

Another side-effect of this, at a functional level, is also that it's easier to write performant code. This will more likely than not require us to pull entire functionals into C++ and/or branch into other performance libraries, which will kill returning something like nn.Sequential and then break backwards compatibility.

On Scale, PadTrim, LC2CL, BLC2CBL The LC/CL and BLC/CBL setup feels a bit like the image formats (NHWC/NCHW) that are often supported for various reasons (mostly performance). I don't mind us being opinionated here for the sake of standardization and require users to use transpose until performance reasons prevent this. We should nonetheless mention this as a precondition in the docs for all functionals and layers that have an explicit assumption here. This also falls under category 3).

In summary the end state might be i) We have a core compute backend of performant functionals with native PyTorch support. ii) We have our own Layers with our own opinionated API and standards iii) We have interfaces for other common libraries including data format support to ease their usage as a torchaudio user.

EDIT: Addition from Slack:

On "Freezing current API behavior for backwards compatibility":

" What I meant is that we should soon have an 0.0.1 release (or so) where we freeze the current API behavior. Then, if someone has backwards compatibility issues they can install 0.0.1 instead of the newest version (let's say 0.1) and refer to earlier release notes as a guide on how to move to the newest API.

The three categories of features I mentioned in my reply are meant to guide the development that can go into 0.0.1 and that can go into 0.0.2 where we'll deprecate some of the API we consider bad. Then in a 0.1 (or maybe an earlier 0.0.3) release we can delete the old API. This won't leave users of the current API in the rain, but gives them a clear bath on how to adopt their current codebase. Aside from that, I'm all in favor of redoing our current API and I 100% agree that e.g. Scale seems like an unnecessary / confusing abstraction. "

keunwoochoi commented 5 years ago

Sounds good, then let's keep discussing here. Thanks for your reply!

Adding orthogonal features / feature extensions from torchaudio-contrib (functionality of type 1 and 2).

We can certainly add something interesting. However, unless there's a need for specific functions in -contrib, I'd be inclined not to add a new feature for the sake of efficiency and stability. What do you think about it?

I'd suggest we move this into an examples folder or make it part of a utilities function. This however falls under category 3). If someone wants to combine transforms I'd expect them to call them in sequence without any container or, if they want to, use nn.Sequential etc. explicitly from the core library.

That also means I'd rather we not return an nn.Sequential from MelSpectogram.

I don't disagree with this idea in general, and a good example folder would be awesome. But, for Melspectrogram, I still believe it's better to provide a wrapper from the library because it's such a popular feature -- as similarly done with nn.CrossEntropyLoss which combines nn.LogSoftmax() and nn.NLLLoss(), hence it'd be nice to make things convenient and reduce potential mistakes.

But..

Another side-effect of this, at a functional level, is also that it's easier to write performant code. ... something like nn.Sequential and then break backwards compatibility.

I'm not sure I understood it correctly. If Melspectrogram layer is a nn.Sequential(), there won't need a Functional for it. Could you elaborate?

cpuhrsch commented 5 years ago

We can certainly add something interesting. However, unless there's a need for specific functions in -contrib, I'd be inclined not to add a new feature for the sake of efficiency and stability. What do you think about it?

If there is orthogonal functionality in torchaudio-contrib, we can move it into torchaudio immediately. I understood the endgoal to be a full merge of torchaudio-contrib into torchaudio. This could be a first phase, which we can start right away.

I don't disagree with this idea in general, and a good example folder would be awesome. But, for Melspectrogram, I still believe it's better to provide a wrapper from the library because it's such a popular feature -- as similarly done with nn.CrossEntropyLoss which combines nn.LogSoftmax() and nn.NLLLoss(), hence it'd be nice to make things convenient and reduce potential mistakes.

I'm not sure I understood it correctly. If Melspectrogram layer is a nn.Sequential(), there won't need a Functional for it. Could you elaborate?

Your original sentence reads

class MelSpectrogram: we have it, which returns a nn.Sequential model consists of Spectrogram and mel-scale filter bank.

What I meant is, that I don't think it's a good idea to return (as in, return from the forward() function) an instance of nn.Sequential. I see now that you meant "inherit from nn.Sequential" instead, which makes a lot more sense. I still think that any layer should only inherit from nn.Module, this way we're least dependent on changes in PyTorch's class structure.

We should definitely add Melspectrogram if it's very popular and implement the composition just as we do with nn.CrossEntropyLoss.

nn.CrossEntropyLoss calls into F. cross_entropy, which then calls the log_softmax and nll_loss functionals. Therefore, the combination happens at the functional level, but the layer itself calls into the specific functional. That abstraction also enables us to drop in a fully native implementation of cross_entropy if we think there is value in e.g. fusing log_softmax and nll_loss.

Melspectrogram is also an example of a feature we can add right away because it doesn't exist at the moment. On that note, we'll also soon convert our transforms into instances of nn.Module instead of object. This adds functionality to the instances, but doesn't break backwards compatibility because nn.Modules have support for call as well. This will then be more aligned with torchaudio-contrib as well.

What other tasks should we write up for this first phase of merging torchaudio-contrib into torchaudio?

ksanjeevan commented 5 years ago

I see now that you meant "inherit from nn.Sequential" instead, which makes a lot more sense. I still think that any layer should only inherit from nn.Module

Just a quick clarification: in torchaudio_contrib, Melspectrogram and Spectrogram are not classes but functions that return an nn.Sequential around the appropriate modules, i.e.:

In [4]: Melspectrogram()                                                                                                        
Out[4]: 
Sequential(
  (0): STFT(fft_len=2048, hop_len=512, frame_len=2048)
  (1): ComplexNorm(power=2.0)
  (2): ApplyFilterbank()
)

And it's the STFT, ComplexNorm, etc. that inherit from nn.Module.

So maybe our current design is closer to your original comment:

I'd suggest we move this into an examples folder or make it part of a utilities function. This however falls under category 3). If someone wants to combine transforms I'd expect them to call them in sequence without any container or, if they want to, use nn.Sequential etc. explicitly from the core library.

Since Spectrogram, Melspectrogram probably shouldn't be in layers.py.

cpuhrsch commented 5 years ago

@ksanjeevan - Is there an advantage for them to be functions that return an nn.Sequential as opposed to being layers that separate out the compute into a functional? That is, for Melspectrogram there will be a melspectrogram function that calls into the (possibly also to-be-created) functionals for stft, complexnorm and applyfilterbank?

ksanjeevan commented 5 years ago

Them being functions I think just reflected the fact that this was a useful/popular combination of the modules. As far as having the composition of the modules instead of a call to a functional that does the composition, users may want to add additional intermediate layers like for example a time stretched melspectrogram:

nn.Sequential(STFT(), StretchSpecTime(), ComplexNorm(), ApplyFilterbank())

Having Melspectrogram simply chain the layers means that doing time stretching for example will only be a slight a modification since all the modules handle the creation and passing of the necessary buffers.

If Melspectrogram is a layer that calls a melspectrogram functional then it will have to create both the window for the stft and the filterbank and pass them? Unless it inherits from Spectrogram... That would circle back to earlier discussions #20 or even #35.

dhpollack commented 5 years ago

on the LC2CL stuff, generally, people from the ML world like to have (channels, length) because it's similar to how image channels are normally presented in pytorch and Librosa defaults to channels first for multi-channel audio. However, people from the signal processing world tend to like (length, channels) because that's traditionally how signals are represented and it's often how the audio is layed out in the audio container.

The original version of torchaudio returned (length, channels) because that's the default of SoX, so I kept the option in their for historical reasons.

cpuhrsch commented 5 years ago

@ksanjeevan - for modification purposes it's better, in my opinion, to provide an easily copy-paste-editable example that users can modify for something simple like this.

For example a class with forward

output = self.STFT(input)
output = self.StretchSpecTime(output)
output = self.ComplexNorm(output)
output = self. ApplyFilterbank(output)

I personally prefer this way of writing operators as a sequence of python functions. It also makes it easy to mix in various other functions with the outputs, whereas nn.Sequential is restricted to Modules.

Now, we might then still want to implement a function for Melspectrogram for performance reasons, if we decide to write a specialized C++ operator or want certain JIT guarantees.

keunwoochoi commented 5 years ago

Particularly on Melspectrogram, one of the reasons for not having a Functional for it was so that it doesn't compute linear-to-mel matrix every time. This can be addressed by optionally passing the filterbank.

To keep nn.Module-sanity, again, a Functional of melspectrograms can be implemented, resuing other Functionals like stft. In such a way we reuse the codes, which is nice. For more flexible usages, towards the merits of using Compositio or nn.Sequential(), the examples can work well.

On LC2CL (@dhpollack ) - I see. For consistency with torch.stft, I'd like the API to expect channels, time. Under such an API, when the user needs to use some spectrogram, there's no redundant overhead, it's just a matter of when it happens.

cpuhrsch commented 5 years ago

@keunwoochoi - What do you think about my proposal of writing it as a sequence of python statements and provide it as an easily copy-paste-editable example and maybe also a highly efficient functional?

cpuhrsch commented 5 years ago

@dhpollack - Since we're creating building blocks in support of PyTorch users we should then probably standardize on the ML convention and transpose for functions that require (length, channels). We also recently did this for nn.LSTM via the batch_first flag, which simply triggers a transpose, since the efficient cudnn implementations require the second dimension to be the batch.

keunwoochoi commented 5 years ago

@cpuhrsch Sounds good to me overall.

maybe also a highly efficient functional?

Just to clarify, you mean doing the same for the implementation of the Functional melspectrogram? Yeah, good. Although, in such a case, wouldn't it be better to have a melspectrogram() functional that does it and let Melspectrogram() layer simply call melspectrogram()? (melspectrogram might be an exception though due to the filter bank generation but I mean in general)

cpuhrsch commented 5 years ago

@keunwoochoi - yes I think the approach of splitting the compute out into highly efficient functionals and then having the layers call into those functionals should yield a good separation and is in tune with pytorch's approach to this topic.

Mistobaan commented 5 years ago

Is dropping sox library still the plan? I was about to open an issue to propose to add a bunch of functions for sox like carlthome/python-audio-effects/pysndfx/dsp.py and a SoxContextLib so you don't have to manually initialize the library. I personally think having sox in the library is a good army swiss knife to have.

keunwoochoi commented 5 years ago

I'm not sure what's the current plan. They are definitely useful, but I don't think it's best hosted in the part of torchaudio. The installation (= dependency) issue will be always there and add maintenance cost and potential risks. I also hardly imagine their operations can benefit from GPUs. Due to these issues, I would avoid using sox as a part of a system if it requires some reliability and efficiency. For a quick and hacky use-case, one can easily plug it in their - maybe - preprocessing stage.

Side note - maybe some of the filters (e.g., EQ) could be re-implemented in torch :)

Mistobaan commented 5 years ago

If we can replace the functionalities of sox in pure torch style then definitely is the way to go. We can make a list of the sox utilities, rank them, and add a help-wanted tag. Once we have enough of them we could drop support for sox.

vincentqb commented 5 years ago

I agree we should reduce the dependency on sox. Making it optional would be a win. Replacing sox completely (and leveraging GPU where possible) would be even better, but then we need to bring in many of the features. I like the idea of a help-wanted issue :)

faroit commented 5 years ago

@vincentqb We had a long discussion about this in https://github.com/keunwoochoi/torchaudio-contrib/issues/31. I think many people agree that fast audio loading is really important and for many users of torchaudio the load functionality is probably the only place where they touch the sox lib. Therefore it would make a lot of since if only load and save would be replaced by native torch code (probably interfacing libsndfile or just reading the wav bits from scripts like its being done in tensorflow.io).

keunwoochoi commented 5 years ago

Recently we had a chat with Matt Larson in Nvidia with some other Spotifier. I begged them for some low-level implementation of wav file loader (and even mp3 decoder). Hmm.. @faroit would you be still able to spend some time on a fast wav loader? What's our plan at the moment?