pytorch / audio

Data manipulation and transformation for audio signal processing, powered by PyTorch
https://pytorch.org/audio
BSD 2-Clause "Simplified" License
2.49k stars 644 forks source link

Room Impulse Response Simulation Support in TorchAudio #2624

Open nateanl opened 2 years ago

nateanl commented 2 years ago

For release 2.0, we plan to add support for multi-channel room impulse response simulation methods under torchaudio.functional. The implementation is based on pyroomacoustics, that supports both "image source method", and "image source + ray tracing" (hybrid) method. We will support both modes in two separate methods.

Diagram

Here is the diagram of how the code works:

image

Both methods compute image sources as the first step. The difference is, for pure image source method, only absorption_coefficient is used to estimate the attenuations for each order. while for the hybrid method, both absorption_coefficient and scattering are used, if scattering is provided by users. Then the image sources locations are applied to estimate impulse responses (IR) in _build_rir method. The hybrid method applies ray tracing to estimate IRs for late reverberations.

Besides the above two methods, we plan to support an Array Transfer Function (ATF) based simulation method, here is the diagram:

image

The first few steps are the same as simulate_rir_ism or simulate_rir_hybrid, depending on the mode it selects.

API Design

The API of simulate_rir_ism will be like:

simulate_rir_ism(
    room : Tensor,
    mic_array : Tensor,
    source : Tensor,
    sample_rate : int,
    max_order: int,
    wall_material : str = “”,
    ceiling_material : str = “”,
    floor_material : str = “”,
    air_absorption : bool = False,
    temperature: float = None,
    humidity: float = None,
) -> Tensor

where room is a 1D Tensor with D values that represents the room size, where D depend on whether the room is a 2D or 3D room. mic_array is a 2D Tensor with dimensions (channel, D), representing the coordinates of microphones in an array. source is a 1D Tensor with D values which represents the coordinates of the sound source. sample_rate is an integer to decide the sample rate of simulated RIRs. max_order is the maximum order of wall reflections to save the computation in image source method. temperature and humidity are parameters to compute the sound speed, by default the sound speed is 343 m/s.

The returned Tensor is a 2D Tensor with dimensions (channel, max_rir_length). channel is number of microphones in the array. given max_order, we compute the maximum distance d_max of all qualified image sources to the microphone array, then max_rir_length is computed by d_max / C * sample_rate + filter_len, where C is the sound speed, filter_len is the filter length in impulse response simulation.

material is the most tricky argument. In pyroomacoustics, it can accept a single floating-point value, assuming it is same for all 6 walls (4 walls + ceiling + floor), or a dictionary of materials where each wall has a different absorption coefficient. In the most extreme case, it is a dictionary of 6 materials, where each material has a list of absorption coefficients that is for a specific center frequency, in such case, we should also provide the list of center frequencies to compute the attenuations.

Based on the above use cases, there are two possible APIs for the materials:

Option 1

Give limited str choices to the wall, ceiling, and floor. The input arguments will be wall_material, ceiling_material, and floor_material, respectively. The options can be found in https://github.com/LCAV/pyroomacoustics/blob/master/pyroomacoustics/data/materials.json, that record the coefficients of real materials. The shortcoming of the method, is that it can't be differentiable, if users want to estimate the absorption coefficients via a neural network.

Option 2

Use absorption and center_frequency as the input argument, the type will be Union[float, Tensor].

The shortcoming of this option, is that it can accept unrealistic materials that don't exist (the best we can do is make sure the coefficients are smaller than 1). The advantage, is the module can be differentiable, i.e., passing the room size, source location, along with the coefficients as input, and generate the RIRs as the output.

We would like to hear users' feedback, to decide how to proceed the API design.

nateanl commented 2 years ago

cc @fakufaku @mravanelli @sw005320

fakufaku commented 2 years ago

@nateanl thanks for starting the discussion.

Here are some initial thoughts.

Wall characteristics

@nateanl proposes two options. Is it acceptable to have both options ? This is what we try to do in pyroomacoustics. If the argument is a string, then we pull the material from the database. If not, we follow something similar to option 2.

Another way to have both options would be to provide a helper function that takes as argument the material name (as option 1) and returns the appropriate tensor to pass to the simulation function following option 2.

I will try to ask around what people use in pyroomacoustics, but I am not fully aware on it is being used :) It seems to be used way beyond just ML augmentation applications, by some acousticians and musicians.

Number of frequency bands and center frequencies

I think pyroomacoustics tries to be too flexible wrt number of bands and choice of center frequencies. They have to be provided for every new material defined. This is a bit too heavy. In hindsight, I think now that having a fixed set of octave bands (e.g., starting at 125 Hz) is sufficient. In that case, the number of bands is a function of the input sampling frequency, and so are the center frequencies. It would be reasonable to require that the user provides the coefficients for the fixed bands and frequencies and perform interpolation/extrapolation offline.

Temperature and humidity

These parameters are necessary for:

Another option is to allow the user to manually set the speed of sound. Both have pros and cons

Pros/cons of using temperature/humidity

Disadvantage of providing only speed of sound, is that we need to work backward what the temp/hum should be and choose air absorption coeff accordingly.

One solution would be to have a helper function that takes a speed of sound as input and returns a plausible pair of temp/hum for this speed of sound.

Support fractional sampling rate

sample_rate is an integer to decide the sample rate of simulated RIRs.

I would suggest to support the sampling rate to be either a float, or at least that both be supported. Supporting fractional sampling rate allows to simulate minute variations of the sampling rate, which is useful in some niche applications like asynchronous array processing where sampling rates may vary by a fraction of a Hertz.

I don't see any reason to use a purely integer sampling rate (except consistency of torchaudio API ?) so that choosing the general case should be preferred.

sw005320 commented 2 years ago

Very cool!

I'm curious what would be the main target of this development. Is it to provide the precise RIR as much as possible? Is it intended as a data augmentation method? Is it used to make the generation process of room impulse response a part of a computational graph? Or does it try to cover everything?

Maybe, there would be other targets.

mthrok commented 2 years ago

Thanks @nateanl. The description looks good overall.

nateanl commented 2 years ago

Thanks all for the discussions.

nateanl commented 2 years ago

To support differentiability, the output shape needs to be stable. However, room, mic_array, source, and max_order all can affect the length of the RIR signal.

To solve this issue, I think it's good to add a output_length argument in simulate_rir_ism method. If the actual rir signal is longer than output_length, the tail of signal is cut, or zero values are padded if the actual rir signal is shorter. The functionality is very similar to max_order, which decides how many image sources are included in computation then decides the final length of the signal.

The question for the output_length argument is:

DanTremonti commented 1 year ago

Hi, will this RIR simulation support adding an audio signal to the source? Something like,

room.add_source([1.,1.], signal=signal)

For reference: pyroomacoustics documentation