mthrok commented 4 years ago

tl;dr: how to migrate to new backend/interface in 0.7

If you are using torchaudio in Linux/macOS environments, please use torchaudio.set_audio_backend("sox_io") to adopt to the upcoming changes.
If you are in Windows environment, please set torchaudio.USE_SOUNDFILE_LEGACY_INTERFACE = False and reload backend to use the new interface.
Note that this ships with some bug-fixes for formats other than 16bit signed integer WAV, so you might experience some BC-breaking changes as described in the section below.

News [UPDATE] 2021/03/06

All the migration works have been completed on master branch.

[UPDATE] 2021/02/12

Added bits_per_sample and encoding argument (replaced dtype) to save function.

[UPDATE] 2021/01/29

Added encoding to AudioMetaData

[UPDATE] 2021/01/22

Added format argument to load/info/save function.
bits_per_sample to AudioMetaData

[UPDATE] 2020/10/21

Added Description of "soundfile" backend legacy interface.

[UPDATE] 2020/09/18

Added migration guide for "soundfile" backend.
Moved the phase when "soundfile" backend signatures change from 0.9.0 to 0.8.0 so that they match with "sox_io" backend, which becomes default in 0.8.0.

[UPDATE] 2020/09/17

Added information on deprecation of native libsox structures such as signalinfo_t and encoding_t.

Improving I/O for correct and consistent experience

This is an announcement for users that we are making backward-incompatible changes to I/O functions of torchaudio backends from 0.7.0 release throughout 0.9.0 release.

What is affected?

Public APIs
- torchaudio.load
- [Linux/macOS] By switching the default backend from "sox" backend to "sox_io" backend in 0.8.0, loading audio formats other than 16bit signed integer WAV returns the correct tensor.
- [Linux/macOS/Windows] The signature of "soundfile" backend will be change in 0.8.0 to match that of "sox_io" backend.
- torchaudio.save
- [Linux/macOS] By switching to "sox_io" backend, saving audio files will no longer degrade the data. The supported format will be restricted to the tested formats only. (please refer to the doc for the supported formats.)
- [Linux/macOS/Windows] The signature of "soundfile" backend will be change in 0.8.0 to match that of "sox_io" backend.
- torchaudio.info
- [Linux/macOS/Windows] The signature of "soundfile" backend will be change in 0.8.0 to match that of "sox_io" backend.
- torchaudio.load_wav
- will be removed in 0.9.0. (load function with normalize=False will provide the same functionality)
Internal APIs The following functions/classes of "sox" backend were accidentally exposed and will be removed in 0.9.0. There is no replacement for them. Please use save/load/info functions.
- torchaudio.save_encinfo
- will be removed in 0.9.0
- torchaudio.get_sox_signalinfo_t
- will be removed in 0.9.0
- torchaudio.get_sox_encodinginfo_t
- will be removed in 0.9.0
- torchaudio.get_sox_option_t
- will be removed in 0.9.0
- torchaudio.get_sox_bool
- will be removed in 0.9.0

The signatures of the other backends are not planned to be changed within this overhaul plan.

Classes
- torchaudio.SignalInfo and torchaudio.EncodingInfo
- will be replaced with AudioMetaData in 0.8.0 for "soundfile" backend
- will be removed in 0.9.0

Why

There are currently three backends in torchaudio. (Please refer to the documentation for the detail.)

"sox" backend is the original backend, which binds libsox with pybind11. The functionalities (load / save / info) of this backend are not well-tested and have number of issues. (See https://github.com/pytorch/audio/pull/726).

Fixing these issues in backward-compatible manner is not straightforward. Therefore while we were adding TorchScript-compatible I/O functions, we decided to deprecate this original "sox" backend and replace it with the new backend ("sox_io" backend), which is confirmed not to have those issues.

When we are switching the default backend for Linux/macOS from "sox" to "sox_io" backend, we would like to align the interface of "soundfile" backend, therefore, we introduced the new interface (not a new backend to reduce the number of public API) to "soundfile" backend.

When / What Changes

The following is the timeline for the planned changes;

Phase	Expected Release	Expected Changes
1	0.7.0 (Oct 2020)	`"sox"` backend issues deprecation warning. ~#904~ `"soundfile"` backend issues warning of expected signature change. ~#906~ Add the new interface to `"soubdfile"` backend. ~#922~ `load_wav` function of all backends are marked as deprecated. ~#905~
2	0.8.0 (March 2021)	[BC-Breaking] `"sox_io"` backend becomes default backend. Function signatures of `"soundfile"` backend are aligned with `"sox_io"` backend. ~#978~ `get_sox_XXX` functions issue deprecation warning. ~#975~
3	0.9.0	`"sox"` backend is removed. ~#1311~ The legacy interface of `"soundfile"` backend is removed. ~#1311~ [BC-Breaking] `load_wav` functions are removed from all backends. ~#1362~

Planned signature changes of `"soundfile"` backend in 0.8.0

The following is the planned signature change of "soundfile" backend functions in 0.8.0 release.

`info` function

AudioMetaData implementation can be found here. The placement of the AudioMetaData might be changed.

~0.7.0	0.8.0
```python def info( filepath: str, ) -> Tuple[SignalInfo, EncodingInfo] ```	```python def info( filepath: str, format: Optional[str], ) -> AudioMetaData ```

Migration

The values returned from info function will be changed. Please use the corresponding new attributes.

~0.7.0	0.8.0
```python si, ei = torchaudio.info(filepath) sample_rate = si.rate num_frames = si.length num_channels = si.channels precision = si.precision bits_per_sample = ei.bits_per_sample encoding = ei.encoding ```	```python metadata = torchaudio.info(filepath) sample_rate = metadata.sample_rate num_frames = metadata.num_frames num_channels = metadata.num_channels bits_per_sample = metadata.bits_per_sample encoding = metadata.encoding ```

Note If the attribute you are using is missing, file a Feature Request issue.

`load` function

~0.7.0

0.8.0

```python def load( filepath: str, # out: Optional[Tensor] = None, # To be removed. # Currently not used # Raise AssertionError if given normalization: Optional[bool] = True, # To be renamed to normalize. # Currently only accept True # Raise AssertionError if given channels_first: Optional[bool] = True, num_frames: int = 0, offset: int = 0, # To be renamed to frame_offset # signalinfo: SignalInfo = None, # To be removed # Currently not used # Raise AssertionError if given # encodinginfo: EncodingInfo = None, # To be removed # Currently not used # Raise AssertionError if given filetype: Optional[str] = None # To be removed # Currently not used ) -> Tuple[Tensor, int] ```

```python def load( filepath: str, frame_offset: int = 0, num_frames: int = -1, normalize: bool = True, channels_first: bool = True, format: Optional[str] = None, # only required for file-like object input ) -> Tuple[Tensor, int] ```

Migration

Please change the argument names;

normalization -> normalize
offset -> frame_offst

~0.7.0	0.8.0
```python waveform, sample_rate = torchaudio.load( filepath, normalization=normalization, channels_first=channels_first, num_frames=num_frames, offset=offset, ) ```	```python waveform, sample_rate = torchaudio.load( filepath, frame_offset=frame_offset, num_frames=num_frames, normalize= normalization, channels_first=channels_first, ) ```

`save` function

~0.7.0	0.8.0
```python def save( filepath: str, src: Tensor, sample_rate: int, precision: int = 16, # moved to `bits_per_sample` argument channels_first: bool = True ) ```	```python def save( filepath: str, src: Tensor, sample_rate: int, channels_first: bool = True, compression: Optional[float] = None, # Added only for compatibility. # soundfile does not support compression option # Raises Warning if not None format: Optional[str] = None, encoding: Optoinal[str] = None, bits_per_sample: Optional[int] = None, ) ```

Migration

~0.7.0	0.8.0
```python torchaudio.save( filepath, waveform, sample_rate, channels_first ) ```	```python torchaudio.save( filepath, waveform, sample_rate, channels_first, bits_per_sample=16, ) # You can also designate audio format with `format` and configure the encoding with `compression` and `encoding`. See https://pytorch.org/audio/master/backend.html#save for the detail ```

BC-breaking changes

Read and write operations on the formats other than WAV 16-bit signed integer were affected by small bugs.

snakers4 commented 3 years ago

Fixing these issues in backward-compatible manner is not straightforward. Therefore while we were adding TorchScript-compatible I/O functions, we decided to deprecate this original "sox" backend and replace it with the new backend ("sox_io" backend), which is confirmed not to have those issues.

When we are switching the default backend for Linux/macOS from "sox" to "sox_io" backend, we would like to align the interface of "soundfile" backend, therefore, we introduced the new interface (not a new backend to reduce the number of public API) to "soundfile" backend.

Just a quick question, does it mean that since 0.7 or 0.8 we can include torchaudio.load inside of our jit-traced modules? Are you planning to support only Linux, or will you also have a list of binaries for some other platforms (i.e. mobile, raspberry pi)? With soundfile backend?

mthrok commented 3 years ago

Hi @snakers4

does it mean that since 0.7 or 0.8 we can include torchaudio.load inside of our jit-traced modules?

Yes. Technically, you can do it already with 0.6, however, the corresponding library is not available in any form yet, so you cannot run it outside Python application. I have a prototype C++ app in my branch which depends on refactored torchaudio. The model I used can be found here

I plan to propose this to the team after the release work, but no fixed time frame for landing it yet or even I am not sure if I can land this. This was an exercise to learn how much we can do with TorchScript, and I have found that the I/O-capability is very limited. It can only load audio data from files. I intend to look into other ways to get tensor data (like pass memory objects to TorchScript) but it's not in the top priority in my list.

Are you planning to support only Linux, or will you also have a list of binaries for some other platforms (i.e. mobile, raspberry pi)?

We are considering the possibility to add an I/O module (not another backend but something like torchaudio.io), that works not just on Linux/macOS, but also on Windows. We are thinking to bind a correction of codecs libraries that are cross-platform. Mobile is not necessarily in our scope, because we do not have an infrastructure to test them, or we have not seen a demand for it yet. Hypothetically, if the refactored torchaudio is landed, the build-process will be CMake, so it will be easier for those familiar with CMake, but again, these plans are not finalized. We are trying to figure out a good "research to production" usecase.

With soundfile backend?

The Python "soudfile" package is not TorchScript compatible, so one of the thing we are considering as a part of the I/O module described above is to bind libsnd directly.

snakers4 commented 3 years ago

Nice! This is probably months from becoming actually useful by end users like us, but this increases the value of pytorch ecosystem quite a bit

Btw, currently a vad in torch audio seems to be a port of some energy based algorithm

We are planning to make a public general torch-scriptable noise / voise / music VAD pre-trained on large voice / noise / music corpora

Guess we could collaborate on that

mthrok commented 3 years ago

@snakers4

Nice! This is probably months from becoming actually useful by end users like us,

Ah, that's very optimistic view, although that's what I am aiming for. I am working on a RFC with example usage, so that community can respond. Then we will finalize the interface and will start working on the implementation.

but this increases the value of pytorch ecosystem quite a bit

Thanks, that's a nice reaction to have. One of the things we struggle is to get a signal from the community, so feedback like that is really helpful. (and motivating for me ;) )

Btw, currently a vad in torch audio seems to be a port of some energy based algorithm

The current VAD is basically, the port of sox implementation.

We are planning to make a public general torch-scriptable noise / voise / music VAD pre-trained on large voice / noise / music corpora

Guess we could collaborate on that

That's very interesting. Please keep us updated!

snakers4 commented 3 years ago

One of the things we struggle is to get a signal from the community, so feedback like that is really helpful. (and motivating for me ;) )

the current state of audio is that there are no go-to tools / components, that would work on all platforms there is record.js for browsers, but porting models to js is a pain now (looks like the only decent option is re-implementing from scratch in tf.js, onnx.js has very poor layer support) ofc, you can go low-level and compile everything for each platform, but usually you care about your algorithms working properly in real life first

in real projects you basically need a VAD + STT + some post-processing VAD ideally should be served on edge to improve user experience, whereas STT can be better served via an API (if you use OPUS e.g. traffic is negligible) there is nothing stopping us from making our own VAD in PyTorch, but the actual audio reading part will be outside as well

for edge deployments we still need 2-4x size reduction in model size (which is already achievable) but as I mentioned there still is no easy way to run a pytorch model in a browser

That's very interesting. Please keep us updated!

I will post an update here

tbazin commented 3 years ago

This is great news, this will definitely improve trust and adoption of torchaudio 🙂 !

expectopatronum commented 3 years ago

This might be a stupid question, but should the warning UserWarning: "sox" backend is being deprecated. The default backend will be changed to "sox_io" backend in 0.8.0 and "sox" backend will be removed in 0.9.0. Please migrate to "sox_io" backend. Please refer to https://github.com/pytorch/audio/issues/903 for the detail. disappear after setting the backend?

I import torchaudio in the following way:

import torchaudio
torchaudio.set_audio_backend("sox_io")

but still get the above warning.

mthrok commented 3 years ago

Hi @expectopatronum

The warning is issued at the time import torchaudio is executed, where the default backend is set. I get that it's annoying and sorry for the confusion, but I really needed to raise a strong awareness as the sox backend was not handling data correctly.

mthrok commented 3 years ago

@expectopatronum

If you use nightly builds, then the default backend is already changed to the new one, and you won't see the warning.

expectopatronum commented 3 years ago

Hi @expectopatronum

The warning is issued at the time import torchaudio is executed, where the default backend is set. I get that it's annoying and sorry for the confusion, but I really needed to raise a strong awareness as the sox backend was not handling data correctly.

No worries, I just wanted to make sure I am doing it right! Thanks for the quick reply!

faroit commented 3 years ago

@mthrok I have a problem getting int16 saving to work on 0.7.2. What is the recommended procedure for this?

Furthermore, you mentioned above:

Convert the input Tensor to the type that corresponds to the precision you want to save.

just converting a [-1, 1] to(torch.int16) wouldn't create a valid PCM 16bit wav file since it still has to be denormalized. Is this supposed to be done by the user?

mthrok commented 3 years ago

@mthrok I have a problem getting int16 saving to work on 0.7.2. What is the recommended procedure for this?

Furthermore, you mentioned above:

Convert the input Tensor to the type that corresponds to the precision you want to save.

just converting a [-1, 1] to(torch.int16) wouldn't create a valid PCM 16bit wav file since it still has to be denormalized. Is this supposed to be done by the user?

@faroit

Yeah, one needs to denormalize the Tensor, that's what I meant there. I updated the description.

faroit commented 3 years ago

@mthrok

Yeah, one needs to denormalize the Tensor, that's what I meant there. I updated the description.

thanks. Give that by far the most likely use-case for audio-to-audio models are:

16bit PCM audio input -> 32bit float torch model -> 16bit PCM audio output

and I don't think users should write out 32bit float except for they really want to (its twice the file-size). As such, it would be nice if the denormalization is builtin in to make int16 as simple to use as possible.

mthrok commented 3 years ago

@faroit

and I don't think users should write out 32bit float except for they really want to (its twice the file-size). As such, it would be nice if the denormalization is builtin in to make int16 as simple to use as possible.

You are bringing up a very good point. Do you have a suggestion for API change? The followings are the things we want to keep in mind;

Correct (if saving float32, the same information should be recoverable up to the precision)
No subtlety/surprise (conversion that involves potential data loss should be explicit)
Convenient

I think adding a new argument of target dtype and default to int16 is one way.

snakers4 commented 3 years ago

@mthrok

Hi again,

Regarding this discussion

We are planning to make a public general torch-scriptable noise / voice / music VAD pre-trained on large voice / noise / music corpora

Guess we could collaborate on that

That's very interesting. Please keep us updated!

Basically we have released the bare-bones version here:

https://github.com/snakers4/silero-vad

We are planning to add a couple of network "heads", finish the docs and then submit to Torch hub:

Number detector (sometimes especially in enterprise people want to make data anonymous, and personal data is basically name + some numbers)
Spoken language classifier (low hanging fruit)
We can add some other easy heads like music detector (i.e. now we have voice vs noise + music, but we can have music vs voice + noise, music is kind of similar to noise)

Please do not hesitate to provide feedback We were mostly aiming at having small enough networks to be run on 1 core of any CPU, even mobile or IOT devices Turns our for VAD / number detector / language or music classifier you can have very high performance with quite tiny networks

faroit commented 3 years ago

You are bringing up a very good point. Do you have a suggestion for API change? I think adding a new argument of target dtype and default to int16 is one way.

I would be in favor of defaulting (and converting) to 16bit PCM except for when users would set a different dtype.

Correct (if saving float32, the same information should be recoverable up to the precision) No subtlety/surprise (conversion that involves potential data loss should be explicit) Convenient

if someone wants a "correct" - and recoverable - output, torch.save exist and is convenient to use also for audio tensors. Which is why i think torchaudio.save should be more closer to what is used in the audio domain.

aturahc13 commented 3 years ago

I have a question about migrating to 'sox_io' backend from 'sox'. I used >>> torchaudio.set_audio_backend("sox_io") after $ python3. It shows no error. However, it seems that the backend is not to be changed. For example, after exit() and doing python3 again, the warning message(The default backend will be changed to "sox_io" backend in 0.8.0) still comes out. How can I migrate correctly? Thank you.

mthrok commented 3 years ago

Hi @aturahc13

That's the correct way to set the backend for the active session. For example, help(torchaudio.load) should display the different help message before and after the set_audio_backend call. The thing is that we did not ship a way to persist the configuration, so the next time the Python is launched, it goes back to the default backend.

aturahc13 commented 3 years ago

The thing is that we did not ship a way to persist the configuration, so the next time the Python is launched, it goes back to the default backend.

Thank you. So should I do set_audio_backend every time when I use $ python3 ? Or should I wait for the update? Actually, all I want to do is "not showing the warning message for 'sox will be deprecated...'". I know there is a way to hidden warning messages themselves. But if there is a way to migrate the backend by hand, I will try it and asked this question. Thank you.

f0k commented 3 years ago

Sorry I'm late to the party, I'm not using torchaudio yet, but interested in using it, and came here because of the backend deprecation warning.

normalize: bool = True,

As a non-user, I would expect that this normalizes the waveform based on the maximum amplitude value. I would also be unsurprised if it actually just converts from integers to floats. Reading the current docstring for sox_io, it says that sample values are always normalized to [-1.0, 1.0]. It is ambiguous whether it normalizes based on the maximum amplitude found in the input, or based on the data type. If the latter, what about changing the parameter name to as_float, or floatify? This would also make clear why it only makes a difference for integer wave files. Alternatively, for more flexibility, it could take a dtype parameter which defaults to float32, and scales whenever converting from integers. dtype=None would return the original dtype. Of course this would mean extra work to support conversion from int8 to int16 and the like.

waveform = 128 (waveform < 0) + 127 * (waveform > 0)

I'm surprised -- is this the correct way to do it, using a different factor for the positive and negative part? All the code I've seen for converting from integers to floats just uses a single scaling factor for positive and negative parts (2**bits - 1), so the opposite direction should also use a single factor to not distort the audio.

mthrok commented 3 years ago

(note: I updated the save un-normalization code snippet based on the suggestion.)

Hi @f0k

Thanks for the comment. Those are very good points.

Let me first tell you the context. The design principle for the new I/O modules are

Correct Since I/O is the first part of data processing, these I/O modules must be returning the data as accurate (close to that in file format) as possible.
Easy to use We want our library to be easy to use. In this case, since it's common practice for DL application to work on floating point values within the range of[-1.0, 1.0].
Predictable/reversible behavior Since we want the library to be a good building block of research/real world application, we want our features to be well mannered.

For the normalization, it is because of the principle 2 and 3 that we return the normalized value by default, and the normalization is performed on fixed coefficients. (Determined by dtypes) If we normalize the resulting tensor with the value found in the tensor, users will have questions like "what was the normalization coefficient being used?", which they might never get an answer. Also it is because of the principle 1 we want to provide the option to return the uncompressed data without normalized. This design is influenced by spicy.io.wavfile.read function. If someone is working on non-DL application and wants to decode some audio data in the format other Python libraries do not support, they can use torchaudio as PyTorch provides zero-overhead conversion from Tensor to NumPy NDArray type. Now, for the parameter name "normalization", I get that it's confusing. (There were other users who had the same confusion.) This is kind of historical. The previous backend had similar argument and when I started workin on this module, we did not intend to introduce the BC-braking change. As of your suggestion of as_float or floatify, I think there is still an ambiguity, as for the range value of the resulting Tensor. It is more explicit about the data type, but none of them are perfect, so I am in favor of keeping it as-is. However I think the documentation should be updated so that normalization is based on data type. For the dtype argument, it would be nice to do but that's also something users can do easily. And since we expect floating type with [-1.0, 1.0] value range throughout the library (except kaldi module that was introduced without design review, which we plan to address), and the use of integer type is reserved for user-specific case, so I think the use-case is under defined from our perspective.

About the un-normalization process. I looked into some detail and now I think you are right. Let me give you why I suggested the formula. When I started writing the new loading function in C++, I wondered how I know my code is doing the right thing the resulting Tensor has right values. I ended up with this. Internally, libsox represents 32 bit signed integer so normalization was needed. At the time I did not know how libsox internally do the conversion, so I set up the test and change the normalization strategy until I found an acceptable one. (That is, values are close to what sox command generates, and there should be no overflow) I ended up with this normalization, which is the reverse of what you pointed out. This achieved about 4e-05 (or 3e-03 for mp3) closeness, which was the best.

Now, I understand the code base of libsox better and I digged into it to find how libsox does it and found the following. As you say it does normalization with single value and apply clipping.

https://github.com/dmkrepo/libsox/blob/b9dd1a86e71bbd62221904e3e59dfaa9e5e72046/src/sox.h#L994

I think I can update the implementation to do the same and that should yield the result even closer to sox.

For the saving part, as @faroit suggested above, I am thinking to include un-normalization inside of the save function and default to 16-bit signed integer. So that users are not bothered for un-normalization and to cover the most of real world use case with default.

mthrok commented 3 years ago

The thing is that we did not ship a way to persist the configuration, so the next time the Python is launched, it goes back to the default backend.

Thank you. So should I do set_audio_backend every time when I use $ python3 ? Or should I wait for the update? Actually, all I want to do is "not showing the warning message for 'sox will be deprecated...'". I know there is a way to hidden warning messages themselves. But if there is a way to migrate the backend by hand, I will try it and asked this question. Thank you.

Hi @aturahc13

In the next release, (expected early March) the default backend will be switched to "sox_io", so you will not need to do anything once you update to it. Until then sorry but you need to do set_audio_backend all the time.

mthrok commented 3 years ago

@mthrok

Hi again,

Regarding this discussion

We are planning to make a public general torch-scriptable noise / voice / music VAD pre-trained on large voice / noise / music corpora

Guess we could collaborate on that

That's very interesting. Please keep us updated!

Basically we have released the bare-bones version here:

https://github.com/snakers4/silero-vad

We are planning to add a couple of network "heads", finish the docs and then submit to Torch hub:

Number detector (sometimes especially in enterprise people want to make data anonymous, and personal data is basically name + some numbers)

Spoken language classifier (low hanging fruit)

We can add some other easy heads like music detector (i.e. now we have voice vs noise + music, but we can have music vs voice + noise, music is kind of similar to noise)

Please do not hesitate to provide feedback We were mostly aiming at having small enough networks to be run on 1 core of any CPU, even mobile or IOT devices Turns our for VAD / number detector / language or music classifier you can have very high performance with quite tiny networks

Hi @snakers4

Sorry for the late reply and thanks for the update. This is very cool. I have questions on mobile I/O situation. How are you feeding the audio in your use case? Did you work on real-time application?

snakers4 commented 3 years ago

Reading the current docstring for sox_io, it says that sample values are always normalized to [-1.0, 1.0]. It is ambiguous whether it normalizes based on the maximum amplitude found in the input, or based on the data type. If the latter, what about changing the parameter name to as_float, or floatify? This would also make clear why it only makes a difference for integer wave files. Alternatively, for more flexibility, it could take a dtype parameter which defaults to float32, and scales whenever converting from integers. dtype=None would return the original dtype. Of course this would mean extra work to support conversion from int8 to int16 and the like.

When we started writing our audio pipelines we essentially used just scipy wavread and pysoundfile reading just integers to avoid any bias inside of audio libraries. And there were some gotchas and insights that may seem relevant in this context:

While in general for DL it may hardly matter if you normalize to say 1 / abs(max(wav)) or to 1 / (2 ** 15 - 1) (NNs can even work better because of such "errors"), it will certainly matter for edge cases, like whispers or just audios with loud noise, high dynamic range, rapid change in volume, etc
The thing that you may want to avoid - is making your pre-processing work against you in these edge cases (or in different IO settings with different libraries);
So I suppose that the optimal strategy may be just to stick to reading audio in [-1, 1] (preferably just manually doing the normalization part), and then perform some form of STFT and then apply some dynamic normalization, so the "brightness" of loud and quiet parts of audio does not differ 10x or 100x. This may help disentangle the actual IO part from the logic part;

snakers4 commented 3 years ago

Hi @snakers4 Sorry for the late reply and thanks for the update. This is very cool. I have questions on mobile I/O situation. How are you feeding the audio in your use case? Did you work on real-time application?

@mthrok Hi,

Since that comment we basically released a more or less a final version. Despite the description the VAD itself (there are multiple heads there) may work fine with related languages (Slavic, Romance, Germanic).

The VAD itself can work with whole files and in a real-time / streaming application. Other heads (number detector, language classifier) work only with whole "files" (which are essentially just [-1, 1] normalized streams of floats), but they are meant to be used downstream after the VAD.

Our VAD is a neural network in PyTorch (JIT / ONNX), and obviously it benefits from batching. This may be a bit complex with streaming especially if you try streaming N streams at the same time. So we provided a few explanations and simple tools to help people integrate our VAD in their applications:

A colab with examples - single file, single stream, multiple streams - https://colab.research.google.com/github/snakers4/silero-vad/blob/master/silero-vad.ipynb - utils are just seamlessly loaded via torch hub, but they are simple enough to be reimplemented in other languages
Explanation how VAD works - https://github.com/snakers4/silero-vad#how-vad-works
VAD knobs - https://github.com/snakers4/silero-vad#vad-parameter-fine-tuning
Quality when we vary the chunk length - https://github.com/snakers4/silero-vad/issues/2#issuecomment-750840434

snakers4 commented 3 years ago

Also I hope that our torch.hub submission of the VAD gets approved soon!

Also also I understand that we may be kind of leaking in our validation metrics and that our validation approach may be too drastic (in real speech, web RTC usually has issues on speech start / end and it is difficult to tune) ... but just for lulz we applied our VAD to NASA's recordings of the Apollo program, and it worked. Web RTC did not really work there.

mthrok commented 3 years ago

@faroit

I made a plan for adding dtype in save function. https://github.com/pytorch/audio/issues/1197 I appreciate it if you can take a look at it.

mthrok commented 3 years ago

Hi @snakers4

When we started writing our audio pipelines we essentially used just scipy wavread and pysoundfile reading just integers to avoid any bias inside of audio libraries. And there were some gotchas and insights that may seem relevant in this context:

While in general for DL it may hardly matter if you normalize to say 1 / abs(max(wav)) or to 1 / (2 ** 15 - 1) (NNs can even work better because of such "errors"), it will certainly matter for edge cases, like whispers or just audios with loud noise, high dynamic range, rapid change in volume, etc

The thing that you may want to avoid - is making your pre-processing work against you in these edge cases (or in different IO settings with different libraries);

So I suppose that the optimal strategy may be just to stick to reading audio in [-1, 1] (preferably just manually doing the normalization part), and then perform some form of STFT and then apply some dynamic normalization, so the "brightness" of loud and quiet parts of audio does not differ 10x or 100x. This may help disentangle the actual IO part from the logic part;

I believe I share the similar view with you. The only concern is that the normalization should be following the standard approach so that model works with other audio sources and libraries, even though, as you pointed out, the impact would not be huge.

Also I hope that our torch.hub submission of the VAD gets approved soon!

I believe your request is approved. The team has invited us to work on these approval process so in the future hopefully your experience will be smooth.

Also also I understand that we may be kind of leaking in our validation metrics and that our validation approach may be too drastic (in real speech, web RTC usually has issues on speech start / end and it is difficult to tune) ... but just for lulz we applied our VAD to NASA's recordings of the Apollo program, and it worked. Web RTC did not really work there.

That's super cool. The recordings are very noisy, right? Did you use noisy train samples as well?

snakers4 commented 3 years ago

That's super cool. The recordings are very noisy, right? Did you use noisy train samples as well?

We used a lot of noise (we collected a proprietary database from a number of sources) when training our models as augmentation Nasa samples are noisy and low SNR We had to tweak 2 probabilities for Nasa though, but they are documented and their meaning is obvious

mthrok commented 3 years ago

Hi @faroit

Regarding the save function. I added bits_per_sample and encoding option in #1226. Unfortunately, I could not make it default to 16-bit for the feat of BC breaking behavior, but with the new parameters, you can just do encoding="PCM_S", bits_per_sample=16 to save tensor data to 16-bit signed integer PCM. You do not need to perform conversion by yourself. Let me know what you think.

mthrok commented 3 years ago

Hi @f0k

Turned out that libsox had the capability of converting numerical types (among float, uint8, int16, int32 etc) so now the save function can handle Tensors of dtype float, uint8, int16, int32 natively. You can do torchaudio.save(path, tensor, format="wav", encoding="PCM_S"|"PCM_U"|"PCM_F", bits_per_sample=8|16|32) without manually converting the Tensor.

faroit commented 3 years ago

Regarding the save function. I added bits_per_sample and encoding option in #1226. Unfortunately, I could not make it default to 16-bit for the feat of BC breaking behavior, but with the new parameters, you can just do encoding="PCM_S", bits_per_sample=16 to save tensor data to 16-bit signed integer PCM. You do not need to perform conversion by yourself. Let me know what you think.

@mthrok sounds good. What happens if you specify non-standard combinations such as encoding="PCM_F", bits_per_sample=8?

mthrok commented 3 years ago

@mthrok sounds good. What happens if you specify non-standard combinations such as encoding="PCM_F", bits_per_sample=8?

@faroit If the combination is allowed, it will be succeed, but if the combination is not supported, then it will cause an error. This was typical hard error vs fallback design decision, and we decided to start from hard error. If users find this behavior too inconvenient, then we can change that.

ketanhdoshi commented 3 years ago

With torchaudio.load() in v0.8, the sox_io backend does not support 24-bit signed PCM audio files. Right now the only workaround is to switch back to the sox backend using torchaudio.set_audio_backend("sox").

Is 24-bit signed going to be supported in 0.9 before removing sox? Thanks!

It is not possible to convert the dataset I'm using to 16-bit or 32-bit.

mthrok commented 3 years ago

With torchaudio.load() in v0.8, the sox_io backend does not support 24-bit signed PCM audio files. Right now the only workaround is to switch back to the sox backend using torchaudio.set_audio_backend("sox").

Is 24-bit signed going to be supported in 0.9 before removing sox? Thanks!

It is not possible to convert the dataset I'm using to 16-bit or 32-bit.

Hi @ketanhdoshi

Thanks for the report. If it's causing you the trouble, we will definitely support it. Since PyTorch does not have 24-bit int type. I need to think of a behavior when normalize=False. In your use case, are you loading data in float32 type? Also if you can tell us a command to generate the same type you are dealing with (with tools like ffmpeg or sox), that will be helpful.

ketanhdoshi commented 3 years ago

With torchaudio.load() in v0.8, the sox_io backend does not support 24-bit signed PCM audio files. Right now the only workaround is to switch back to the sox backend using torchaudio.set_audio_backend("sox"). Is 24-bit signed going to be supported in 0.9 before removing sox? Thanks! It is not possible to convert the dataset I'm using to 16-bit or 32-bit.

Hi @ketanhdoshi

Thanks for the report. If it's causing you the trouble, we will definitely support it. Since PyTorch does not have 24-bit int type. I need to think of a behavior when normalize=False. In your use case, are you loading data in float32 type? Also if you can tell us a command to generate the same type you are dealing with (with tools like ffmpeg or sox), that will be helpful.

Thanks @mthrok. Yes, data is being loaded as float32. Here's an example of a dataset that has many sound files that I'm using that are in 24-bit signed format.

aelimame commented 3 years ago

With torchaudio.load() in v0.8, the sox_io backend does not support 24-bit signed PCM audio files. Right now the only workaround is to switch back to the sox backend using torchaudio.set_audio_backend("sox"). Is 24-bit signed going to be supported in 0.9 before removing sox? Thanks! It is not possible to convert the dataset I'm using to 16-bit or 32-bit.

Hi @ketanhdoshi Thanks for the report. If it's causing you the trouble, we will definitely support it. Since PyTorch does not have 24-bit int type. I need to think of a behavior when normalize=False. In your use case, are you loading data in float32 type? Also if you can tell us a command to generate the same type you are dealing with (with tools like ffmpeg or sox), that will be helpful.

Thanks @mthrok. Yes, data is being loaded as float32. Here's an example of a dataset that has many sound files that I'm using that are in 24-bit signed format.

I'm running into the same issue. I'm loading some 24bit audio files and sox_io fails to load them. I can use sox backend for now but would appreciate if 24bit format can be supported too in sox_io.

A good way to handle the normalize=False is to make it unsupported for this specific format given most of the time people would use normalize=True (at least that's what I do almost always). Another idea would be to convert the 24bit format automatically/internally to 32bit even if normalize=False.

Thanks

aelimame commented 3 years ago

@ketanhdoshi 24-bit support seems to have been added a couple days ago to the master branch https://github.com/pytorch/audio/pull/1389 I tested it (Nightly build) and seems to work for me!

mthrok commented 3 years ago

@aelimame @ketanhdoshi Sorry I forgot to let you know but we added 24-bit support.

It's nice to learn that it is working for you @aelimame. @ketanhdoshi , please try the nightly build and see if it works. If not let us know.

mthrok commented 3 years ago

FYI: @ketanhdoshi @aelimame 24-bit support has been ported to release 0.8.1.

mthrok commented 3 years ago

Closing the issue as 0.9 is released which concludes the migration. Thank you for all the people who gave feedback.

pytorch / audio

[Announcement] Improving I/O for correct and consistent experience #903

Improving I/O for correct and consistent experience

What is affected?

Why

When / What Changes

Planned signature changes of `"soundfile"` backend in 0.8.0

`info` function

Migration

`load` function

Migration

`save` function

Migration

pytorch / audio

[Announcement] Improving I/O for correct and consistent experience #903

Improving I/O for correct and consistent experience

What is affected?

Why

When / What Changes

Planned signature changes of "soundfile" backend in 0.8.0

info function

Migration

load function

Migration

save function

Migration

Planned signature changes of `"soundfile"` backend in 0.8.0

`info` function

`load` function

`save` function