Closed mthrok closed 3 years ago
Fixing these issues in backward-compatible manner is not straightforward. Therefore while we were adding TorchScript-compatible I/O functions, we decided to deprecate this original "sox" backend and replace it with the new backend ("sox_io" backend), which is confirmed not to have those issues.
When we are switching the default backend for Linux/macOS from "sox" to "sox_io" backend, we would like to align the interface of "soundfile" backend, therefore, we introduced the new interface (not a new backend to reduce the number of public API) to "soundfile" backend.
Just a quick question, does it mean that since 0.7 or 0.8 we can include torchaudio.load
inside of our jit-traced modules? Are you planning to support only Linux, or will you also have a list of binaries for some other platforms (i.e. mobile, raspberry pi)? With soundfile
backend?
Hi @snakers4
does it mean that since 0.7 or 0.8 we can include
torchaudio.load
inside of our jit-traced modules?
Yes. Technically, you can do it already with 0.6, however, the corresponding library is not available in any form yet, so you cannot run it outside Python application. I have a prototype C++ app in my branch which depends on refactored torchaudio. The model I used can be found here
I plan to propose this to the team after the release work, but no fixed time frame for landing it yet or even I am not sure if I can land this. This was an exercise to learn how much we can do with TorchScript, and I have found that the I/O-capability is very limited. It can only load audio data from files. I intend to look into other ways to get tensor data (like pass memory objects to TorchScript) but it's not in the top priority in my list.
Are you planning to support only Linux, or will you also have a list of binaries for some other platforms (i.e. mobile, raspberry pi)?
We are considering the possibility to add an I/O module (not another backend but something like torchaudio.io
), that works not just on Linux/macOS, but also on Windows. We are thinking to bind a correction of codecs libraries that are cross-platform. Mobile is not necessarily in our scope, because we do not have an infrastructure to test them, or we have not seen a demand for it yet. Hypothetically, if the refactored torchaudio is landed, the build-process will be CMake, so it will be easier for those familiar with CMake, but again, these plans are not finalized. We are trying to figure out a good "research to production" usecase.
With
soundfile
backend?
The Python "soudfile" package is not TorchScript compatible, so one of the thing we are considering as a part of the I/O module described above is to bind libsnd
directly.
Nice! This is probably months from becoming actually useful by end users like us, but this increases the value of pytorch ecosystem quite a bit
Btw, currently a vad in torch audio seems to be a port of some energy based algorithm
We are planning to make a public general torch-scriptable noise / voise / music VAD pre-trained on large voice / noise / music corpora
Guess we could collaborate on that
@snakers4
Nice! This is probably months from becoming actually useful by end users like us,
Ah, that's very optimistic view, although that's what I am aiming for. I am working on a RFC with example usage, so that community can respond. Then we will finalize the interface and will start working on the implementation.
but this increases the value of pytorch ecosystem quite a bit
Thanks, that's a nice reaction to have. One of the things we struggle is to get a signal from the community, so feedback like that is really helpful. (and motivating for me ;) )
Btw, currently a vad in torch audio seems to be a port of some energy based algorithm
The current VAD is basically, the port of sox implementation.
We are planning to make a public general torch-scriptable noise / voise / music VAD pre-trained on large voice / noise / music corpora
Guess we could collaborate on that
That's very interesting. Please keep us updated!
One of the things we struggle is to get a signal from the community, so feedback like that is really helpful. (and motivating for me ;) )
the current state of audio is that there are no go-to tools / components, that would work on all platforms there is record.js for browsers, but porting models to js is a pain now (looks like the only decent option is re-implementing from scratch in tf.js, onnx.js has very poor layer support) ofc, you can go low-level and compile everything for each platform, but usually you care about your algorithms working properly in real life first
in real projects you basically need a VAD + STT + some post-processing VAD ideally should be served on edge to improve user experience, whereas STT can be better served via an API (if you use OPUS e.g. traffic is negligible) there is nothing stopping us from making our own VAD in PyTorch, but the actual audio reading part will be outside as well
for edge deployments we still need 2-4x size reduction in model size (which is already achievable) but as I mentioned there still is no easy way to run a pytorch model in a browser
That's very interesting. Please keep us updated!
I will post an update here
This is great news, this will definitely improve trust and adoption of torchaudio 🙂 !
This might be a stupid question, but should the warning UserWarning: "sox" backend is being deprecated. The default backend will be changed to "sox_io" backend in 0.8.0 and "sox" backend will be removed in 0.9.0. Please migrate to "sox_io" backend. Please refer to https://github.com/pytorch/audio/issues/903 for the detail.
disappear after setting the backend?
I import torchaudio
in the following way:
import torchaudio
torchaudio.set_audio_backend("sox_io")
but still get the above warning.
Hi @expectopatronum
The warning is issued at the time import torchaudio
is executed, where the default backend is set. I get that it's annoying and sorry for the confusion, but I really needed to raise a strong awareness as the sox backend was not handling data correctly.
@expectopatronum
If you use nightly builds, then the default backend is already changed to the new one, and you won't see the warning.
Hi @expectopatronum
The warning is issued at the time
import torchaudio
is executed, where the default backend is set. I get that it's annoying and sorry for the confusion, but I really needed to raise a strong awareness as the sox backend was not handling data correctly.
No worries, I just wanted to make sure I am doing it right! Thanks for the quick reply!
@mthrok I have a problem getting int16 saving to work on 0.7.2
. What is the recommended procedure for this?
Furthermore, you mentioned above:
Convert the input Tensor to the type that corresponds to the precision you want to save.
just converting a [-1, 1]
to(torch.int16)
wouldn't create a valid PCM 16bit wav file since it still has to be denormalized. Is this supposed to be done by the user?
@mthrok I have a problem getting int16 saving to work on
0.7.2
. What is the recommended procedure for this?Furthermore, you mentioned above:
Convert the input Tensor to the type that corresponds to the precision you want to save.
just converting a
[-1, 1]
to(torch.int16)
wouldn't create a valid PCM 16bit wav file since it still has to be denormalized. Is this supposed to be done by the user?
@faroit
Yeah, one needs to denormalize the Tensor, that's what I meant there. I updated the description.
@mthrok
Yeah, one needs to denormalize the Tensor, that's what I meant there. I updated the description.
thanks. Give that by far the most likely use-case for audio-to-audio models are:
16bit PCM audio input -> 32bit float torch model -> 16bit PCM audio output
and I don't think users should write out 32bit float except for they really want to (its twice the file-size). As such, it would be nice if the denormalization is builtin in to make int16 as simple to use as possible.
@faroit
and I don't think users should write out 32bit float except for they really want to (its twice the file-size). As such, it would be nice if the denormalization is builtin in to make int16 as simple to use as possible.
You are bringing up a very good point. Do you have a suggestion for API change? The followings are the things we want to keep in mind;
I think adding a new argument of target dtype and default to int16
is one way.
@mthrok
Hi again,
Regarding this discussion
We are planning to make a public general torch-scriptable noise / voice / music VAD pre-trained on large voice / noise / music corpora
Guess we could collaborate on that
That's very interesting. Please keep us updated!
Basically we have released the bare-bones version here:
We are planning to add a couple of network "heads", finish the docs and then submit to Torch hub:
Please do not hesitate to provide feedback We were mostly aiming at having small enough networks to be run on 1 core of any CPU, even mobile or IOT devices Turns our for VAD / number detector / language or music classifier you can have very high performance with quite tiny networks
You are bringing up a very good point. Do you have a suggestion for API change? I think adding a new argument of target dtype and default to int16 is one way.
I would be in favor of defaulting (and converting) to 16bit PCM except for when users would set a different dtype.
Correct (if saving float32, the same information should be recoverable up to the precision) No subtlety/surprise (conversion that involves potential data loss should be explicit) Convenient
if someone wants a "correct" - and recoverable - output, torch.save
exist and is convenient to use also for audio tensors. Which is why i think torchaudio.save
should be more closer to what is used in the audio domain.
I have a question about migrating to 'sox_io' backend from 'sox'.
I used >>> torchaudio.set_audio_backend("sox_io")
after $ python3
.
It shows no error.
However, it seems that the backend is not to be changed.
For example, after exit()
and doing python3
again, the warning message(The default backend will be changed to "sox_io" backend in 0.8.0) still comes out.
How can I migrate correctly? Thank you.
Hi @aturahc13
That's the correct way to set the backend for the active session. For example, help(torchaudio.load)
should display the different help message before and after the set_audio_backend
call. The thing is that we did not ship a way to persist the configuration, so the next time the Python is launched, it goes back to the default backend.
The thing is that we did not ship a way to persist the configuration, so the next time the Python is launched, it goes back to the default backend.
Thank you. So should I do set_audio_backend
every time when I use $ python3
?
Or should I wait for the update?
Actually, all I want to do is "not showing the warning message for 'sox will be deprecated...'". I know there is a way to hidden warning messages themselves. But if there is a way to migrate the backend by hand, I will try it and asked this question. Thank you.
Sorry I'm late to the party, I'm not using torchaudio
yet, but interested in using it, and came here because of the backend deprecation warning.
normalize: bool = True,
As a non-user, I would expect that this normalizes the waveform based on the maximum amplitude value. I would also be unsurprised if it actually just converts from integers to floats.
Reading the current docstring for sox_io
, it says that sample values are always normalized to [-1.0, 1.0]
. It is ambiguous whether it normalizes based on the maximum amplitude found in the input, or based on the data type.
If the latter, what about changing the parameter name to as_float
, or floatify
? This would also make clear why it only makes a difference for integer wave files.
Alternatively, for more flexibility, it could take a dtype
parameter which defaults to float32
, and scales whenever converting from integers. dtype=None
would return the original dtype. Of course this would mean extra work to support conversion from int8 to int16 and the like.
waveform = 128 (waveform < 0) + 127 * (waveform > 0)
I'm surprised -- is this the correct way to do it, using a different factor for the positive and negative part? All the code I've seen for converting from integers to floats just uses a single scaling factor for positive and negative parts (2**bits - 1), so the opposite direction should also use a single factor to not distort the audio.
(note: I updated the save un-normalization code snippet based on the suggestion.)
Hi @f0k
Thanks for the comment. Those are very good points.
Let me first tell you the context. The design principle for the new I/O modules are
For the normalization, it is because of the principle 2 and 3 that we return the normalized value by default, and the normalization is performed on fixed coefficients. (Determined by dtypes) If we normalize the resulting tensor with the value found in the tensor, users will have questions like "what was the normalization coefficient being used?", which they might never get an answer. Also it is because of the principle 1 we want to provide the option to return the uncompressed data without normalized. This design is influenced by spicy.io.wavfile.read
function. If someone is working on non-DL application and wants to decode some audio data in the format other Python libraries do not support, they can use torchaudio
as PyTorch
provides zero-overhead conversion from Tensor to NumPy NDArray type.
Now, for the parameter name "normalization"
, I get that it's confusing. (There were other users who had the same confusion.) This is kind of historical. The previous backend had similar argument and when I started workin on this module, we did not intend to introduce the BC-braking change.
As of your suggestion of as_float
or floatify
, I think there is still an ambiguity, as for the range value of the resulting Tensor. It is more explicit about the data type, but none of them are perfect, so I am in favor of keeping it as-is. However I think the documentation should be updated so that normalization is based on data type.
For the dtype
argument, it would be nice to do but that's also something users can do easily. And since we expect floating type with [-1.0, 1.0] value range throughout the library (except kaldi module that was introduced without design review, which we plan to address), and the use of integer type is reserved for user-specific case, so I think the use-case is under defined from our perspective.
About the un-normalization process. I looked into some detail and now I think you are right. Let me give you why I suggested the formula. When I started writing the new loading function in C++, I wondered how I know my code is doing the right thing the resulting Tensor has right values. I ended up with this. Internally, libsox
represents 32 bit signed integer so normalization was needed. At the time I did not know how libsox
internally do the conversion, so I set up the test and change the normalization strategy until I found an acceptable one. (That is, values are close to what sox
command generates, and there should be no overflow) I ended up with this normalization, which is the reverse of what you pointed out. This achieved about 4e-05
(or 3e-03
for mp3) closeness, which was the best.
Now, I understand the code base of libsox
better and I digged into it to find how libsox
does it and found the following. As you say it does normalization with single value and apply clipping.
https://github.com/dmkrepo/libsox/blob/b9dd1a86e71bbd62221904e3e59dfaa9e5e72046/src/sox.h#L994
I think I can update the implementation to do the same and that should yield the result even closer to sox
.
For the saving part, as @faroit suggested above, I am thinking to include un-normalization inside of the save function and default to 16-bit signed integer. So that users are not bothered for un-normalization and to cover the most of real world use case with default.
The thing is that we did not ship a way to persist the configuration, so the next time the Python is launched, it goes back to the default backend.
Thank you. So should I do
set_audio_backend
every time when I use$ python3
? Or should I wait for the update? Actually, all I want to do is "not showing the warning message for 'sox will be deprecated...'". I know there is a way to hidden warning messages themselves. But if there is a way to migrate the backend by hand, I will try it and asked this question. Thank you.
Hi @aturahc13
In the next release, (expected early March) the default backend will be switched to "sox_io"
, so you will not need to do anything once you update to it. Until then sorry but you need to do set_audio_backend
all the time.
@mthrok
Hi again,
Regarding this discussion
We are planning to make a public general torch-scriptable noise / voice / music VAD pre-trained on large voice / noise / music corpora
Guess we could collaborate on that
That's very interesting. Please keep us updated!
Basically we have released the bare-bones version here:
We are planning to add a couple of network "heads", finish the docs and then submit to Torch hub:
- Number detector (sometimes especially in enterprise people want to make data anonymous, and personal data is basically name + some numbers)
- Spoken language classifier (low hanging fruit)
- We can add some other easy heads like music detector (i.e. now we have voice vs noise + music, but we can have music vs voice + noise, music is kind of similar to noise)
Please do not hesitate to provide feedback We were mostly aiming at having small enough networks to be run on 1 core of any CPU, even mobile or IOT devices Turns our for VAD / number detector / language or music classifier you can have very high performance with quite tiny networks
Hi @snakers4
Sorry for the late reply and thanks for the update. This is very cool. I have questions on mobile I/O situation. How are you feeding the audio in your use case? Did you work on real-time application?
Reading the current docstring for sox_io, it says that sample values are always normalized to [-1.0, 1.0]. It is ambiguous whether it normalizes based on the maximum amplitude found in the input, or based on the data type. If the latter, what about changing the parameter name to as_float, or floatify? This would also make clear why it only makes a difference for integer wave files. Alternatively, for more flexibility, it could take a dtype parameter which defaults to float32, and scales whenever converting from integers. dtype=None would return the original dtype. Of course this would mean extra work to support conversion from int8 to int16 and the like.
When we started writing our audio pipelines we essentially used just scipy wavread and pysoundfile reading just integers to avoid any bias inside of audio libraries. And there were some gotchas and insights that may seem relevant in this context:
1 / abs(max(wav))
or to 1 / (2 ** 15 - 1)
(NNs can even work better because of such "errors"), it will certainly matter for edge cases, like whispers or just audios with loud noise, high dynamic range, rapid change in volume, etcHi @snakers4 Sorry for the late reply and thanks for the update. This is very cool. I have questions on mobile I/O situation. How are you feeding the audio in your use case? Did you work on real-time application?
@mthrok Hi,
Since that comment we basically released a more or less a final version. Despite the description the VAD itself (there are multiple heads there) may work fine with related languages (Slavic, Romance, Germanic).
The VAD itself can work with whole files and in a real-time / streaming application. Other heads (number detector, language classifier) work only with whole "files" (which are essentially just [-1, 1] normalized streams of floats), but they are meant to be used downstream after the VAD.
Our VAD is a neural network in PyTorch (JIT / ONNX), and obviously it benefits from batching. This may be a bit complex with streaming especially if you try streaming N streams at the same time. So we provided a few explanations and simple tools to help people integrate our VAD in their applications:
Also I hope that our torch.hub
submission of the VAD gets approved soon!
Also also I understand that we may be kind of leaking in our validation metrics and that our validation approach may be too drastic (in real speech, web RTC usually has issues on speech start / end and it is difficult to tune) ... but just for lulz we applied our VAD to NASA's recordings of the Apollo program, and it worked. Web RTC did not really work there.
@faroit
I made a plan for adding dtype
in save
function. https://github.com/pytorch/audio/issues/1197 I appreciate it if you can take a look at it.
Hi @snakers4
When we started writing our audio pipelines we essentially used just scipy wavread and pysoundfile reading just integers to avoid any bias inside of audio libraries. And there were some gotchas and insights that may seem relevant in this context:
- While in general for DL it may hardly matter if you normalize to say
1 / abs(max(wav))
or to1 / (2 ** 15 - 1)
(NNs can even work better because of such "errors"), it will certainly matter for edge cases, like whispers or just audios with loud noise, high dynamic range, rapid change in volume, etc- The thing that you may want to avoid - is making your pre-processing work against you in these edge cases (or in different IO settings with different libraries);
- So I suppose that the optimal strategy may be just to stick to reading audio in [-1, 1] (preferably just manually doing the normalization part), and then perform some form of STFT and then apply some dynamic normalization, so the "brightness" of loud and quiet parts of audio does not differ 10x or 100x. This may help disentangle the actual IO part from the logic part;
I believe I share the similar view with you. The only concern is that the normalization should be following the standard approach so that model works with other audio sources and libraries, even though, as you pointed out, the impact would not be huge.
Also I hope that our
torch.hub
submission of the VAD gets approved soon!
I believe your request is approved. The team has invited us to work on these approval process so in the future hopefully your experience will be smooth.
Also also I understand that we may be kind of leaking in our validation metrics and that our validation approach may be too drastic (in real speech, web RTC usually has issues on speech start / end and it is difficult to tune) ... but just for lulz we applied our VAD to NASA's recordings of the Apollo program, and it worked. Web RTC did not really work there.
That's super cool. The recordings are very noisy, right? Did you use noisy train samples as well?
That's super cool. The recordings are very noisy, right? Did you use noisy train samples as well?
We used a lot of noise (we collected a proprietary database from a number of sources) when training our models as augmentation Nasa samples are noisy and low SNR We had to tweak 2 probabilities for Nasa though, but they are documented and their meaning is obvious
Hi @faroit
Regarding the save
function. I added bits_per_sample
and encoding
option in #1226. Unfortunately, I could not make it default to 16-bit for the feat of BC breaking behavior, but with the new parameters, you can just do encoding="PCM_S", bits_per_sample=16
to save tensor data to 16-bit signed integer PCM. You do not need to perform conversion by yourself. Let me know what you think.
Hi @f0k
Turned out that libsox
had the capability of converting numerical types (among float, uint8, int16, int32 etc) so now the save
function can handle Tensors of dtype float, uint8, int16, int32 natively. You can do torchaudio.save(path, tensor, format="wav", encoding="PCM_S"|"PCM_U"|"PCM_F", bits_per_sample=8|16|32)
without manually converting the Tensor.
Regarding the save function. I added bits_per_sample and encoding option in #1226. Unfortunately, I could not make it default to 16-bit for the feat of BC breaking behavior, but with the new parameters, you can just do encoding="PCM_S", bits_per_sample=16 to save tensor data to 16-bit signed integer PCM. You do not need to perform conversion by yourself. Let me know what you think.
@mthrok sounds good. What happens if you specify non-standard combinations such as encoding="PCM_F", bits_per_sample=8
?
@mthrok sounds good. What happens if you specify non-standard combinations such as
encoding="PCM_F", bits_per_sample=8
?
@faroit If the combination is allowed, it will be succeed, but if the combination is not supported, then it will cause an error. This was typical hard error vs fallback design decision, and we decided to start from hard error. If users find this behavior too inconvenient, then we can change that.
With torchaudio.load()
in v0.8, the sox_io backend does not support 24-bit signed PCM audio files. Right now the only workaround is to switch back to the sox backend using torchaudio.set_audio_backend("sox")
.
Is 24-bit signed going to be supported in 0.9 before removing sox? Thanks!
It is not possible to convert the dataset I'm using to 16-bit or 32-bit.
With
torchaudio.load()
in v0.8, the sox_io backend does not support 24-bit signed PCM audio files. Right now the only workaround is to switch back to the sox backend usingtorchaudio.set_audio_backend("sox")
.Is 24-bit signed going to be supported in 0.9 before removing sox? Thanks!
It is not possible to convert the dataset I'm using to 16-bit or 32-bit.
Hi @ketanhdoshi
Thanks for the report. If it's causing you the trouble, we will definitely support it.
Since PyTorch does not have 24-bit int type. I need to think of a behavior when normalize=False
.
In your use case, are you loading data in float32
type?
Also if you can tell us a command to generate the same type you are dealing with (with tools like ffmpeg
or sox
), that will be helpful.
With
torchaudio.load()
in v0.8, the sox_io backend does not support 24-bit signed PCM audio files. Right now the only workaround is to switch back to the sox backend usingtorchaudio.set_audio_backend("sox")
. Is 24-bit signed going to be supported in 0.9 before removing sox? Thanks! It is not possible to convert the dataset I'm using to 16-bit or 32-bit.Hi @ketanhdoshi
Thanks for the report. If it's causing you the trouble, we will definitely support it. Since PyTorch does not have 24-bit int type. I need to think of a behavior when
normalize=False
. In your use case, are you loading data infloat32
type? Also if you can tell us a command to generate the same type you are dealing with (with tools likeffmpeg
orsox
), that will be helpful.
Thanks @mthrok. Yes, data is being loaded as float32. Here's an example of a dataset that has many sound files that I'm using that are in 24-bit signed format.
With
torchaudio.load()
in v0.8, the sox_io backend does not support 24-bit signed PCM audio files. Right now the only workaround is to switch back to the sox backend usingtorchaudio.set_audio_backend("sox")
. Is 24-bit signed going to be supported in 0.9 before removing sox? Thanks! It is not possible to convert the dataset I'm using to 16-bit or 32-bit.Hi @ketanhdoshi Thanks for the report. If it's causing you the trouble, we will definitely support it. Since PyTorch does not have 24-bit int type. I need to think of a behavior when
normalize=False
. In your use case, are you loading data infloat32
type? Also if you can tell us a command to generate the same type you are dealing with (with tools likeffmpeg
orsox
), that will be helpful.Thanks @mthrok. Yes, data is being loaded as float32. Here's an example of a dataset that has many sound files that I'm using that are in 24-bit signed format.
I'm running into the same issue. I'm loading some 24bit audio files and sox_io fails to load them. I can use sox backend for now but would appreciate if 24bit format can be supported too in sox_io.
A good way to handle the normalize=False is to make it unsupported for this specific format given most of the time people would use normalize=True (at least that's what I do almost always). Another idea would be to convert the 24bit format automatically/internally to 32bit even if normalize=False.
Thanks
@ketanhdoshi 24-bit support seems to have been added a couple days ago to the master branch https://github.com/pytorch/audio/pull/1389 I tested it (Nightly build) and seems to work for me!
@aelimame @ketanhdoshi Sorry I forgot to let you know but we added 24-bit support.
It's nice to learn that it is working for you @aelimame. @ketanhdoshi , please try the nightly build and see if it works. If not let us know.
FYI: @ketanhdoshi @aelimame 24-bit support has been ported to release 0.8.1
.
Closing the issue as 0.9 is released which concludes the migration. Thank you for all the people who gave feedback.
tl;dr: how to migrate to new backend/interface in
0.7
If you are using
torchaudio
in Linux/macOS environments, please usetorchaudio.set_audio_backend("sox_io")
to adopt to the upcoming changes.If you are in Windows environment, please set
torchaudio.USE_SOUNDFILE_LEGACY_INTERFACE = False
and reload backend to use the new interface.Note that this ships with some bug-fixes for formats other than 16bit signed integer WAV, so you might experience some BC-breaking changes as described in the section below.
News [UPDATE] 2021/03/06
[UPDATE] 2021/02/12
bits_per_sample
andencoding
argument (replaceddtype
) tosave
function.[UPDATE] 2021/01/29
encoding
toAudioMetaData
[UPDATE] 2021/01/22
format
argument toload
/info
/save
function.bits_per_sample
toAudioMetaData
[UPDATE] 2020/10/21
"soundfile"
backend legacy interface.[UPDATE] 2020/09/18
"soundfile"
backend."soundfile"
backend signatures change from 0.9.0 to 0.8.0 so that they match with"sox_io"
backend, which becomes default in 0.8.0.[UPDATE] 2020/09/17
libsox
structures such assignalinfo_t
andencoding_t
.Improving I/O for correct and consistent experience
This is an announcement for users that we are making backward-incompatible changes to I/O functions of
torchaudio
backends from 0.7.0 release throughout 0.9.0 release.What is affected?
Public APIs
torchaudio.load
"sox"
backend to"sox_io"
backend in 0.8.0, loading audio formats other than 16bit signed integer WAV returns the correct tensor."soundfile"
backend will be change in 0.8.0 to match that of"sox_io"
backend.torchaudio.save
"sox_io"
backend, saving audio files will no longer degrade the data. The supported format will be restricted to the tested formats only. (please refer to the doc for the supported formats.)"soundfile"
backend will be change in 0.8.0 to match that of"sox_io"
backend.torchaudio.info
"soundfile"
backend will be change in 0.8.0 to match that of"sox_io"
backend.torchaudio.load_wav
load
function withnormalize=False
will provide the same functionality)Internal APIs The following functions/classes of
"sox"
backend were accidentally exposed and will be removed in 0.9.0. There is no replacement for them. Please usesave
/load
/info
functions.torchaudio.save_encinfo
torchaudio.get_sox_signalinfo_t
torchaudio.get_sox_encodinginfo_t
torchaudio.get_sox_option_t
torchaudio.get_sox_bool
The signatures of the other backends are not planned to be changed within this overhaul plan.
torchaudio.SignalInfo
andtorchaudio.EncodingInfo
AudioMetaData
in 0.8.0 for"soundfile"
backendWhy
There are currently three backends in
torchaudio
. (Please refer to the documentation for the detail.)"sox"
backend is the original backend, which bindslibsox
withpybind11
. The functionalities (load
/save
/info
) of this backend are not well-tested and have number of issues. (See https://github.com/pytorch/audio/pull/726).Fixing these issues in backward-compatible manner is not straightforward. Therefore while we were adding TorchScript-compatible I/O functions, we decided to deprecate this original
"sox"
backend and replace it with the new backend ("sox_io"
backend), which is confirmed not to have those issues.When we are switching the default backend for Linux/macOS from
"sox"
to"sox_io"
backend, we would like to align the interface of"soundfile"
backend, therefore, we introduced the new interface (not a new backend to reduce the number of public API) to"soundfile"
backend.When / What Changes
The following is the timeline for the planned changes;
"sox"
backend issues deprecation warning. ~#904~"soundfile"
backend issues warning of expected signature change. ~#906~"soubdfile"
backend. ~#922~load_wav
function of all backends are marked as deprecated. ~#905~"sox_io"
backend becomes default backend. Function signatures of"soundfile"
backend are aligned with"sox_io"
backend. ~#978~get_sox_XXX
functions issue deprecation warning. ~#975~"sox"
backend is removed. ~#1311~"soundfile"
backend is removed. ~#1311~load_wav
functions are removed from all backends. ~#1362~Planned signature changes of
"soundfile"
backend in 0.8.0The following is the planned signature change of
"soundfile"
backend functions in 0.8.0 release.info
functionAudioMetaData
implementation can be found here. The placement of theAudioMetaData
might be changed.Migration
The values returned from
info
function will be changed. Please use the corresponding new attributes.Note If the attribute you are using is missing, file a Feature Request issue.
load
functionMigration
Please change the argument names;
normalization
->normalize
offset
->frame_offst
save
functionMigration
BC-breaking changes
Read and write operations on the formats other than WAV 16-bit signed integer were affected by small bugs.