Room Impulse Response Simulation Support in TorchAudio

For release 2.0, we plan to add support for multi-channel room impulse response simulation methods under torchaudio.functional. The implementation is based on pyroomacoustics, that supports both "image source method", and "image source + ray tracing" (hybrid) method. We will support both modes in two separate methods.

Diagram

Here is the diagram of how the code works:

Both methods compute image sources as the first step. The difference is, for pure image source method, only absorption_coefficient is used to estimate the attenuations for each order. while for the hybrid method, both absorption_coefficient and scattering are used, if scattering is provided by users. Then the image sources locations are applied to estimate impulse responses (IR) in _build_rir method. The hybrid method applies ray tracing to estimate IRs for late reverberations.

Besides the above two methods, we plan to support an Array Transfer Function (ATF) based simulation method, here is the diagram:

The first few steps are the same as simulate_rir_ism or simulate_rir_hybrid, depending on the mode it selects.

API Design

The API of simulate_rir_ism will be like:

simulate_rir_ism(
    room : Tensor,
    mic_array : Tensor,
    source : Tensor,
    sample_rate : int,
    max_order: int,
    wall_material : str = “”,
    ceiling_material : str = “”,
    floor_material : str = “”,
    air_absorption : bool = False,
    temperature: float = None,
    humidity: float = None,
) -> Tensor

where room is a 1D Tensor with D values that represents the room size, where D depend on whether the room is a 2D or 3D room. mic_array is a 2D Tensor with dimensions (channel, D), representing the coordinates of microphones in an array. source is a 1D Tensor with D values which represents the coordinates of the sound source. sample_rate is an integer to decide the sample rate of simulated RIRs. max_order is the maximum order of wall reflections to save the computation in image source method. temperature and humidity are parameters to compute the sound speed, by default the sound speed is 343 m/s.

The returned Tensor is a 2D Tensor with dimensions (channel, max_rir_length). channel is number of microphones in the array. given max_order, we compute the maximum distance d_max of all qualified image sources to the microphone array, then max_rir_length is computed by d_max / C * sample_rate + filter_len, where C is the sound speed, filter_len is the filter length in impulse response simulation.

material is the most tricky argument. In pyroomacoustics, it can accept a single floating-point value, assuming it is same for all 6 walls (4 walls + ceiling + floor), or a dictionary of materials where each wall has a different absorption coefficient. In the most extreme case, it is a dictionary of 6 materials, where each material has a list of absorption coefficients that is for a specific center frequency, in such case, we should also provide the list of center frequencies to compute the attenuations.

Based on the above use cases, there are two possible APIs for the materials:

Option 1

Give limited str choices to the wall, ceiling, and floor. The input arguments will be wall_material, ceiling_material, and floor_material, respectively. The options can be found in https://github.com/LCAV/pyroomacoustics/blob/master/pyroomacoustics/data/materials.json, that record the coefficients of real materials. The shortcoming of the method, is that it can't be differentiable, if users want to estimate the absorption coefficients via a neural network.

Option 2

Use absorption and center_frequency as the input argument, the type will be Union[float, Tensor].

In float case, it assumes the coefficient is same for all walls.
In Tensor case, there are two possible use case
- if it is a 1D Tensor, the shape should be (4,) (2D room) or (6,) (3D room), meaning each wall has its own coefficient.
- if it is a 2D Tensor, the shape should be (num_bands, 4) or (num_bands, 6), where num_bands refers to the number of center frequencies. center_frequency should also be provided in such case.

The shortcoming of this option, is that it can accept unrealistic materials that don't exist (the best we can do is make sure the coefficients are smaller than 1). The advantage, is the module can be differentiable, i.e., passing the room size, source location, along with the coefficients as input, and generate the RIRs as the output.

We would like to hear users' feedback, to decide how to proceed the API design.

cc @fakufaku @mravanelli @sw005320

@nateanl thanks for starting the discussion.

Here are some initial thoughts.

Wall characteristics

@nateanl proposes two options. Is it acceptable to have both options ? This is what we try to do in pyroomacoustics. If the argument is a string, then we pull the material from the database. If not, we follow something similar to option 2.

Another way to have both options would be to provide a helper function that takes as argument the material name (as option 1) and returns the appropriate tensor to pass to the simulation function following option 2.

I will try to ask around what people use in pyroomacoustics, but I am not fully aware on it is being used :) It seems to be used way beyond just ML augmentation applications, by some acousticians and musicians.

Number of frequency bands and center frequencies

I think pyroomacoustics tries to be too flexible wrt number of bands and choice of center frequencies. They have to be provided for every new material defined. This is a bit too heavy. In hindsight, I think now that having a fixed set of octave bands (e.g., starting at 125 Hz) is sufficient. In that case, the number of bands is a function of the input sampling frequency, and so are the center frequencies. It would be reasonable to require that the user provides the coefficients for the fixed bands and frequencies and perform interpolation/extrapolation offline.

Temperature and humidity

These parameters are necessary for:

compute speed of sound
pick the correct set of air absorption coefficients

Another option is to allow the user to manually set the speed of sound. Both have pros and cons

Pros/cons of using temperature/humidity

+ intuitive
+ no risk of picking speed of sound / air absorption coeff corresponding to different environmental conditions
- in some setup, we may know the speed of sound, but not temp/humidity, and then we need to work backward which temp/hum will give the desired speed of sound

Disadvantage of providing only speed of sound, is that we need to work backward what the temp/hum should be and choose air absorption coeff accordingly.

One solution would be to have a helper function that takes a speed of sound as input and returns a plausible pair of temp/hum for this speed of sound.

Support fractional sampling rate

sample_rate is an integer to decide the sample rate of simulated RIRs.

I would suggest to support the sampling rate to be either a float, or at least that both be supported. Supporting fractional sampling rate allows to simulate minute variations of the sampling rate, which is useful in some niche applications like asynchronous array processing where sampling rates may vary by a fraction of a Hertz.

I don't see any reason to use a purely integer sampling rate (except consistency of torchaudio API ?) so that choosing the general case should be preferred.

Very cool!

I'm curious what would be the main target of this development. Is it to provide the precise RIR as much as possible? Is it intended as a data augmentation method? Is it used to make the generation process of room impulse response a part of a computational graph? Or does it try to cover everything?

Maybe, there would be other targets.

Thanks @nateanl. The description looks good overall.

RE: API Design
- Can you add a description about the returned Tensor?
- Does this support GPU?
RE: Option 1 & 2 I see no reason why we have to have only one API for this. It sounds like we can have a core API, which is fully customizable (requires expertise and trial and errors), then somewhat easy-to-use wrapper function. I think this is what @fakufaku is suggesting as well. However, if taking the route of helper function, I am not sure where to put the helper function in torchaudio. Perhaps torchaudio.utils module.
RE: Sample rate Sample rate does not need to be constraint on integer type, given that the underlying algorithm can handle non-integral type. I am okay with fractional sample rate, but we need to think about the approach. Should we do it for this function or make it globally available? I am working on FFmpeg-based media encoder, and at some places, fractional frame rate (or an approximation) is required for NTSC 30000 / 1001 frame rate.
RE: Union types nit) If the argument is not used for Tensor-like ops, use of tuple might be better fit, because that way the dispatcher can check the number of elements for you.

Thanks all for the discussions.

RE: Option1 & Option 2 Regarding the options for materials, it seems better to to use float or Tensor in the core API, which represents the absorption coefficients. The center frequency can be more restricted, by setting the default values, e.g., [ 125. 250. 500. 1000. 2000. 4000. 8000.]. Then we can provide a helper function to lookup coefficients for a given material name. Same for sound speed and temperature and humidity, we can use sound speed in the core API, and provide a helper function to compute the sound speed given temperature and humidity.
RE: Is it to provide the precise RIR as much as possible. is it intended as a data augmentation method? We try to simulate the precise RIR as much as possible, however, as a starting point, we limit the room as a shoebox-like room to simplify the simulation. Later if there is requirement on more realistic rooms, we can add a new method that builds the room from "corners" and use DFS to find image sources.
RE: Is it used to make the generation process of room impulse response a part of a computational graph? & Does this support GPU? Yes, the method should be differentiable, and GPU-compatible.
RE: description about the returned Tensor I made changes in the above post and also list it here. The returned Tensor is a 2D Tensor with dimensions (channel, max_rir_length). channel is number of microphones in the array. given max_order, we compute the maximum distance d_max of all qualified image sources to the microphone array, then max_rir_length is computed by d_max / C * sample_rate + filter_len, where C is the sound speed, filter_len is the filter length in impulse response simulation.

To support differentiability, the output shape needs to be stable. However, room, mic_array, source, and max_order all can affect the length of the RIR signal.

To solve this issue, I think it's good to add a output_length argument in simulate_rir_ism method. If the actual rir signal is longer than output_length, the tail of signal is cut, or zero values are padded if the actual rir signal is shorter. The functionality is very similar to max_order, which decides how many image sources are included in computation then decides the final length of the signal.

The question for the output_length argument is:

Should the number of image sources be increased or decreased when output_length is given? i.e., should the two arguments max_order and output_length be mutual exclusive, or output_length just simply do trimming or zero-padding?
What value is good as the default value for output_length?

Hi, will this RIR simulation support adding an audio signal to the source? Something like,

room.add_source([1.,1.], signal=signal)

For reference: pyroomacoustics documentation

pytorch / audio