Open nateanl opened 2 years ago
cc @fakufaku @mravanelli @sw005320
@nateanl thanks for starting the discussion.
Here are some initial thoughts.
@nateanl proposes two options. Is it acceptable to have both options ? This is what we try to do in pyroomacoustics. If the argument is a string, then we pull the material from the database. If not, we follow something similar to option 2.
Another way to have both options would be to provide a helper function that takes as argument the material name (as option 1) and returns the appropriate tensor to pass to the simulation function following option 2.
I will try to ask around what people use in pyroomacoustics, but I am not fully aware on it is being used :) It seems to be used way beyond just ML augmentation applications, by some acousticians and musicians.
I think pyroomacoustics tries to be too flexible wrt number of bands and choice of center frequencies. They have to be provided for every new material defined. This is a bit too heavy. In hindsight, I think now that having a fixed set of octave bands (e.g., starting at 125 Hz) is sufficient. In that case, the number of bands is a function of the input sampling frequency, and so are the center frequencies. It would be reasonable to require that the user provides the coefficients for the fixed bands and frequencies and perform interpolation/extrapolation offline.
These parameters are necessary for:
Another option is to allow the user to manually set the speed of sound. Both have pros and cons
Pros/cons of using temperature/humidity
+
intuitive+
no risk of picking speed of sound / air absorption coeff corresponding to different environmental conditions-
in some setup, we may know the speed of sound, but not temp/humidity, and then we need to work backward which temp/hum will give the desired speed of soundDisadvantage of providing only speed of sound, is that we need to work backward what the temp/hum should be and choose air absorption coeff accordingly.
One solution would be to have a helper function that takes a speed of sound as input and returns a plausible pair of temp/hum for this speed of sound.
sample_rate
is an integer to decide the sample rate of simulated RIRs.
I would suggest to support the sampling rate to be either a float, or at least that both be supported. Supporting fractional sampling rate allows to simulate minute variations of the sampling rate, which is useful in some niche applications like asynchronous array processing where sampling rates may vary by a fraction of a Hertz.
I don't see any reason to use a purely integer sampling rate (except consistency of torchaudio API ?) so that choosing the general case should be preferred.
Very cool!
I'm curious what would be the main target of this development. Is it to provide the precise RIR as much as possible? Is it intended as a data augmentation method? Is it used to make the generation process of room impulse response a part of a computational graph? Or does it try to cover everything?
Maybe, there would be other targets.
Thanks @nateanl. The description looks good overall.
RE: API Design
RE: Option 1 & 2
I see no reason why we have to have only one API for this. It sounds like we can have a core API, which is fully customizable (requires expertise and trial and errors), then somewhat easy-to-use wrapper function. I think this is what @fakufaku is suggesting as well. However, if taking the route of helper function, I am not sure where to put the helper function in torchaudio. Perhaps torchaudio.utils
module.
RE: Sample rate Sample rate does not need to be constraint on integer type, given that the underlying algorithm can handle non-integral type. I am okay with fractional sample rate, but we need to think about the approach. Should we do it for this function or make it globally available? I am working on FFmpeg-based media encoder, and at some places, fractional frame rate (or an approximation) is required for NTSC 30000 / 1001 frame rate.
RE: Union types nit) If the argument is not used for Tensor-like ops, use of tuple might be better fit, because that way the dispatcher can check the number of elements for you.
Thanks all for the discussions.
RE: Option1 & Option 2
Regarding the options for materials, it seems better to to use float
or Tensor
in the core API, which represents the absorption coefficients. The center frequency
can be more restricted, by setting the default values, e.g., [ 125. 250. 500. 1000. 2000. 4000. 8000.]
. Then we can provide a helper function to lookup coefficients for a given material name.
Same for sound speed and temperature and humidity, we can use sound speed in the core API, and provide a helper function to compute the sound speed given temperature and humidity.
RE: Is it to provide the precise RIR as much as possible. is it intended as a data augmentation method? We try to simulate the precise RIR as much as possible, however, as a starting point, we limit the room as a shoebox-like room to simplify the simulation. Later if there is requirement on more realistic rooms, we can add a new method that builds the room from "corners" and use DFS to find image sources.
RE: Is it used to make the generation process of room impulse response a part of a computational graph? & Does this support GPU? Yes, the method should be differentiable, and GPU-compatible.
RE: description about the returned Tensor
I made changes in the above post and also list it here.
The returned Tensor is a 2D Tensor with dimensions (channel, max_rir_length)
. channel
is number of microphones in the array. given max_order
, we compute the maximum distance d_max
of all qualified image sources to the microphone array, then max_rir_length
is computed by d_max / C * sample_rate + filter_len
, where C is the sound speed, filter_len
is the filter length in impulse response simulation.
To support differentiability, the output shape needs to be stable. However, room
, mic_array
, source
, and max_order
all can affect the length of the RIR signal.
To solve this issue, I think it's good to add a output_length
argument in simulate_rir_ism
method. If the actual rir signal is longer than output_length
, the tail of signal is cut, or zero values are padded if the actual rir signal is shorter. The functionality is very similar to max_order
, which decides how many image sources are included in computation then decides the final length of the signal.
The question for the output_length
argument is:
output_length
is given? i.e., should the two arguments max_order
and output_length
be mutual exclusive, or output_length
just simply do trimming or zero-padding?output_length
?Hi, will this RIR simulation support adding an audio signal to the source? Something like,
room.add_source([1.,1.], signal=signal)
For reference: pyroomacoustics documentation
For release 2.0, we plan to add support for multi-channel room impulse response simulation methods under
torchaudio.functional
. The implementation is based on pyroomacoustics, that supports both "image source method", and "image source + ray tracing" (hybrid) method. We will support both modes in two separate methods.Diagram
Here is the diagram of how the code works:
Both methods compute image sources as the first step. The difference is, for pure image source method, only
absorption_coefficient
is used to estimate the attenuations for each order. while for the hybrid method, bothabsorption_coefficient
andscattering
are used, ifscattering
is provided by users. Then the image sources locations are applied to estimate impulse responses (IR) in_build_rir
method. The hybrid method applies ray tracing to estimate IRs for late reverberations.Besides the above two methods, we plan to support an Array Transfer Function (ATF) based simulation method, here is the diagram:
The first few steps are the same as
simulate_rir_ism
orsimulate_rir_hybrid
, depending on the mode it selects.API Design
The API of
simulate_rir_ism
will be like:where
room
is a 1D Tensor withD
values that represents the room size, where D depend on whether the room is a 2D or 3D room.mic_array
is a 2D Tensor with dimensions(channel, D)
, representing the coordinates of microphones in an array.source
is a 1D Tensor withD
values which represents the coordinates of the sound source.sample_rate
is an integer to decide the sample rate of simulated RIRs.max_order
is the maximum order of wall reflections to save the computation in image source method.temperature
andhumidity
are parameters to compute the sound speed, by default the sound speed is 343 m/s.The returned Tensor is a 2D Tensor with dimensions
(channel, max_rir_length)
.channel
is number of microphones in the array. givenmax_order
, we compute the maximum distanced_max
of all qualified image sources to the microphone array, thenmax_rir_length
is computed byd_max / C * sample_rate + filter_len
, where C is the sound speed,filter_len
is the filter length in impulse response simulation.material
is the most tricky argument. Inpyroomacoustics
, it can accept a single floating-point value, assuming it is same for all 6 walls (4 walls + ceiling + floor), or a dictionary of materials where each wall has a different absorption coefficient. In the most extreme case, it is a dictionary of 6 materials, where each material has a list of absorption coefficients that is for a specific center frequency, in such case, we should also provide the list of center frequencies to compute the attenuations.Based on the above use cases, there are two possible APIs for the materials:
Option 1
Give limited
str
choices to the wall, ceiling, and floor. The input arguments will bewall_material
,ceiling_material
, andfloor_material
, respectively. The options can be found in https://github.com/LCAV/pyroomacoustics/blob/master/pyroomacoustics/data/materials.json, that record the coefficients of real materials. The shortcoming of the method, is that it can't be differentiable, if users want to estimate the absorption coefficients via a neural network.Option 2
Use
absorption
andcenter_frequency
as the input argument, the type will beUnion[float, Tensor]
.float
case, it assumes the coefficient is same for all walls.Tensor
case, there are two possible use case(4,)
(2D room) or(6,)
(3D room), meaning each wall has its own coefficient.(num_bands, 4)
or(num_bands, 6)
, wherenum_bands
refers to the number of center frequencies.center_frequency
should also be provided in such case.The shortcoming of this option, is that it can accept unrealistic materials that don't exist (the best we can do is make sure the coefficients are smaller than 1). The advantage, is the module can be differentiable, i.e., passing the room size, source location, along with the coefficients as input, and generate the RIRs as the output.
We would like to hear users' feedback, to decide how to proceed the API design.