Open ksanjeevan opened 5 years ago
I would be in favor of more verbose names and am not a big fan of the current librosa argument names.
I would be in favor of more verbose names and am not a big fan of the current librosa argument names.
Great. Also note the first goal in the README: "Clear, readible names for class/functions/arguments"
Let's collect some suggestions then. Collecting things from my own code and making them more uniform or more consistent with torch.stft
:
samples
: a tensor of audio samples
sample_rate
: the rate of audio samples (samples per second)
spect
: a tensor of spectrogram or mel/CQT spectrogram frames
frame_len
frame_length
: the length of a spectrogram frame in terms of audio samples
fps
frame_rate
: the rate of spectrogram frames (frames per second)
hop_size
hop_length
: the number of samples between the starts of consecutive frames (i.e., frame_rate == sample_rate / hop_length
)
num_bins
: the number of bins in a spectrogram
num_bands
: the number of bands in a mel/CQT spectrogram
min_freq
: the lowest frequency of the lowest band in a mel/CQT spectrogram
max_freq
: the highest frequency of the highest band in a mel/CQT spectrogram
hop_length
was taken from torch.stft
. frame_length
was for consistency with frame_rate
, which I find a more important concept than the hop_length
, since it's independent of the sample rate. torch.stft
calls it win_length
. If we go with win_length
, we'll have win_rate
. Not so nice.
torch.stft
has n_fft
. I would call this fft_length
, and make frame_length
the primary parameter to set, with fft_length
defaulting to match it, rather than the other way round. I'd then define num_bins = fft_length // 2 + 1
.
(If we have sample_rate
for samples
, then frame_rate
should be for frames
, not for spect
, but that's maybe too generic then...)
I'm not sure if num_bins
and num_bands
should be separate, since they're both describing part of the shape of a spectrogram. But if both of them are called num_bands
, we may run into problems naming the arguments for frequency-domain filterbanks?
Oh, another thing: We may want to differentiate between lengths in terms of discrete samples/frames, and lengths in terms of seconds. I would suggest size
for the former, and length
for the latter, but then we'd move away from torch.stft
. Is this worth it? Since we'll provide our own wrapper for torch.stft
in order to support higher-dimensional tensors, we'd be free to change argument names.
Discuss! :)
Some comments on a few of them:
samples
: this could be confused with number_of_samples
. A better name might be signal
, but it might depend on the actual class, method where you using it.
sample_rate
: I never can decide between sampling rate
and sample rate
, but I guess the seconds wins as it is shorter?
samples
: this could be confused withnumber_of_samples
I think the latter should be num_samples
with analogy to num_bins
, num_bands
. If we rename samples
to signal
(which I'd also be okay with, although I find samples
more explicit), then we still shouldn't shorten num_samples
to just samples
because of the same ambiguity.
I never can decide between
sampling rate
andsample rate
, but I guess the seconds wins as it is shorter?
Yes, interesting. So that's something that sr
avoids :) Another argument for sample_rate
would be the consistency with samples
and num_samples
, and with frame_rate
(not framing_rate
).
Totally agree, I would not propose to shorten num_samples
to samples
. The proposal was more to not use samples
at all as new and non-familiar users might confuse it.
I see the point that sample_rate
fits better with frame_rate
, I was just wondering if "sample rate" or "sampling rate" is better English ;)
The proposal was more to not use
samples
at all as new and non-familiar users might confuse it.
Ok. Another confusion might be with training samples / training examples. I'd also be okay with signal
and signal_size
, signal_length
. Hmm, signal
sounds like something continuous and eternal, but here we deal with discretized, finite excerpts. For this reason, I still like samples
and num_samples
, but then what's the name for the length of samples
in terms of seconds?
Oh, these name discussions :) They can take just as long as defining the function signatures, but I think they are very important to design a good API.
Another confusion might be with training samples / training examples.
I don't really want to go deeply into your name discussion here... but I would be very much in favor of not using samples
here at all, as well.
So often it happened that I supervised people in ML+Audio and they always were confused by what we audio-heads are calling samples
.
If we need vocabulary for sequential samples or frames I would just leave this to timesteps
closely following the RNN syntax.
Also, additionally avoiding frames
is :+1: because even in DSP people can't agree on the definitions (spectrogram frames vs. samples*channels in the codec world).
timesteps
Does it mean we're gonna use something like timestep_rate
for sample rate? My opinion is to keep using sample
and sample_rate
, feeling like giving up those names are a bit of stretch. LET'S BE AUDIOHEADS.
size vs len(gth)
How about n_
or num_
for discrete (instead of size
) and use len(gth)
for things in [second]? size
feels like something potentially multi-dimensional. And -- stuff like n_fft
, frame_len
would be changed accordingly, right?
One more good thing about this is, on n_fft
issue, we can keep using n_fft
so that it's matched to torch.stft
, which is great. num_fft
would be also good, at least that's pretty similar to n_fft
, no one would get confused.
I don't really want to go deeply into your name discussion here...
Oh, please do! Names are important and hard to change.
but I would be very much in favor of not using
samples
here at all, as well.
signal
it is, then? Another advantage of signal
is that you can still form a plural of it, this doesn't work with samples
.
timesteps
Does it mean we're gonna use something like
timestep_rate
for sample rate?
As I understood, this only applies when we need a generic term that encompasses both. For functions that clearly only take raw audio samples, or clearly only take spectrograms, we would use the more specific terms for clarity. I'm not sure if timesteps
is a good general term, to me it sounds like it would only refer to time stamps, not the associated data. And instead of timestep_rate
, I'd lean towards only using rate
when we need a generalization of sample_rate
and frame_rate
.
Also, additionally avoiding
frames
is +1 because even in DSP people can't agree on the definitions
So spect
is good, but what about frame_rate
?
size
feels like something potentially multi-dimensional.
Mhm, true, torch.Tensor.size()
... but that also allows you to give the size of a single dimension. I think the dimension would be clear from the context. Do you have a counter-example?
How about
n_
ornum_
for discrete (instead ofsize
) and uselen(gth)
for things in [second]?
But num_
is a prefix, and length
doesn't work as a prefix. We could use _num
as a suffix.
But even then, length
and size
are exchangeable in a way that length
and num
aren't:
frame_length
- frame_size
- frame_num
?
signal_length
- signal_size
- signal_num
?
spect_length
- spect_size
- spect_num
?
But I see that spect_size
could refer to the two-dimensional shape. More ideas?
And -- stuff like
n_fft
,frame_len
would be changed accordingly, right?
Yes, that was the idea, we'd find a consistent scheme for naming things and use it throughout.
One more good thing about this is, on
n_fft
issue, we can keep usingn_fft
so that it's matched totorch.stft
, which is great.
I don't think we need to match torch.stft
. Users of torchaudio
will not need to use torch.stft
directly, and may not even know about it. Also, the argument names of torch.stft
are the way they are just because they were changed to follow librosa; before pytorch/pytorch#9308 it was called fft_size
.
New idea:
_len
or _length
for the length in samples, i.e., what len
would return (if we had the time in the first dimension, which we can't due to the BCHW layout in PyTorch)._dur
or _duration
for the length in secondsI like _len
for its correspondence to len()
, but I don't really like _dur
because it's slightly cryptic. On the other hand, I assume many people think _duration
is too long, even in times of autocompletion?
Complete next proposal:
signal
: a tensor of audio samples
sample_rate
: the rate of audio samples (samples per second)
frame_len
: the length of a spectrogram frame in terms of audio samples
frame_dur
: the length of a spectrogram frame in terms of seconds (can be float
or fractions.Fraction
)
fft_len
: the size of the FFT in terms of audio samples; using zero-padding if longer than frame_len
fft_dur
: the size of the FFT in terms of seconds
spect
: a tensor of spectrogram or mel/CQT spectrogram frames
frame_rate
: the rate of spectrogram frames (frames per second)
hop_len
: the number of samples between the starts of consecutive frames (i.e., frame_rate == sample_rate / hop_len
)
hop_dur
: the number of seconds between the starts of consecutive frames (i.e., frame_rate == 1 / hop_dur
)
num_bins
: the number of bins in a spectrogram
num_bands
: the number of bands in a mel/CQT spectrogram
min_freq
: the lowest frequency of the lowest band in a mel/CQT spectrogram
max_freq
: the highest frequency of the highest band in a mel/CQT spectrogram
sample_rate: the rate of audio samples (samples per second)
since sample
is only used here. Can we just do rate
?
otherwise: 👍
I would go for length
and duration
as dur
is not obvious.
I would not go for rate
as it is not obvious which rate is meant.
I would go for
length
andduration
asdur
is not obvious.
I like long names anyway. Others?
Can we just do
rate
?
Not easily, we need to discuss this.
rate
is ambiguous. If we compute a mel filterbank from min_freq
and max_freq
given in Hz, we need to know the sample rate (or the Nyquist frequency), but not the frame rate, although it's about computing a mel spectrogram.If we want to avoid sample_rate
and frame_rate
(you argued against using the term "frames" before), we would need something like input_rate
and output_rate
in functions that change rates. But what do we do for the mel filterbank?
We can also go for signal_rate
and spect_rate
. Sounds a bit weird, but would be consistent with the argument names.
Also note that if we just use rate
, we will use it throughout, when loading audio, when resampling it, whenever. Could be good, or could be unclear.
Agreed overall.
frame_duration
orframe_dur
but when do we use it?
Agreed overall.
But there were still some open questions.
frame_duration
orframe_dur
but when do we use it?
When we define a model that should be applicable to both 22 kHz and 44 kHz files, for example. We can either force users to do the conversion, or provide an interface that allows either lengths or durations.
When we define a model that should be applicable to both 22 kHz and 44 kHz files,
Wouldn't it be quite a high-level model than we'd add as a pytorch/audio?
On rate
vs sample_rate
: Seems like sample_rate
is getting the most votes.
So far we've been already adopting what @f0k summarized up there. One thing, that I once opened an separated issue about #46, is names for spectrograms.
spect
is not bad, but I'd like to suggest that we should specify little more. My suggestions are based on this idea:
mag
, phase
> real
--> because it provides more relevant info for the usage.waveforms
> x
--> more clarityThe variable names would be
mag_specgrams
for a batch of magnitude STFT, CQT, melspectrogramphase_specgrams
for a batch of phase of STFT, or all the other phasescomplex_specgrams
for a batch of complex STFT, or whatever else in the future.waveforms
for a batch of audio signals. I think when we're saying signal, it's almost always waveform. Nevertheless, signal
sounds a little vague. So why not making it sure.mag_cqts
would be preferred. This change would be applied for both argument names and variable names so that both APIs and codes would be easier to read, less error-prone, more intuitive.
One more thing.
num_bins
: the number of bins in a spectrogramnum_bands
: the number of bands in a mel/CQT spectrogram
I'd vote for num_freqs
for the number of (linear) frequency bins, num_mels
for the number of mel bins. In my recent PR (#58) I adopted it without checking this issue, but when reading the code I was quite confused because both "freq" and "mel" can be combined with "bins" and "bands".
num_freqs
:+1:
complex_specgrams
I still don't like this as mentioned earlier... i just had a new idea: what about tf
...
mag_tf
phase_tf
complex_tf
that could easily also fit for both STFT and MEL
umm, the word itself is not bad. but i'm little nervous about inventing a new one, because the naming was about things being readable without searching for the meaning at all.
~also it looks like tensorflow :P~
that said, yeah, your reason is totally valid. hm..
@faroit could you please refer to your earlier post as I was not able to find it in this thread.
I think spectrogram refers more to the visual representation of the actual time-frequency representation. Is this the reason you would prefer the term tf
?
I'm not completely convinced by tf
as
Is there a consensus on using specgram
and waveform
? I liked the shorter spect
and signal
better, but I can't argue against the longer names (except that they're longer).
I find tf
too cryptic, to me this is on the line of the variables in the title that we wanted to get away from.
When we define a model that should be applicable to both 22 kHz and 44 kHz files,
Wouldn't it be quite a high-level model than we'd add as a pytorch/audio?
Not sure. This is something that could be handled by the mel spectrogram computation, and if we claim this to be the responsibility of torchaudio, it should provide a way to do so.
I'd vote for
num_freqs
for the number of (linear) frequency bins,num_mels
for the number of mel bins.
I find num_freqs
a bit unintuitive, it sounds as if counting the frequencies, not the bins. But I agree that bins
might be unclear to somebody not used to work with spectrograms.
num_mels
is okay, but what would you replace num_bands
with? num_mels
is too specific as a replacement, it does not work for other filterbanks than the mel filterbank. num_filters
?
Can we try returning to doing complete proposals (even realizing that I'm the only one doing so)?
waveform
: a tensor of audio samples
waveforms
: a batch of the former
sample_rate
: the rate of audio samples (samples per second)
specgram
: a tensor of spectrogram or mel/CQT spectrogram frames
specgrams
: a batch of the former
mag_specgram
: specific spectrogram that only holds magnitudes
phase_specgram
: specific spectrogram that only holds phases
complex_specgram
: specific spectrogram that holds complex values
mel_specgram
: a mel spectrogram
mag_mel_specgram
: a mel spectrogram with magnitudes only
frame_length
: the length of a spectrogram frame in terms of audio samples
frame_duration
: the length of a spectrogram frame in terms of seconds (can be float
or fractions.Fraction
)
fft_length
: the size of the FFT in terms of audio samples; using zero-padding if longer than frame_length
fft_duration
: the size of the FFT in terms of seconds
frame_rate
: the rate of spectrogram frames (frames per second)
hop_length
: the number of samples between the starts of consecutive frames (i.e., frame_rate == sample_rate / hop_length
)
hop_duration
: the number of seconds between the starts of consecutive frames (i.e., frame_rate == 1 / hop_duration
)
num_bins
: the number of bins in a linear spectrogram (i.e., num_bins == fft_length // 2 + 1
)
num_bands
: the number of bands in a mel/CQT spectrogram
min_freq
: the lowest frequency of the lowest band in a mel/CQT spectrogram
max_freq
: the highest frequency of the highest band in a mel/CQT spectrogram
I just realized that I find num_freqs
a bit confusing in conjunction with min_freq
and max_freq
. So I went back to num_bins
. Better ideas?
Ok, we're getting there!
consensus on
specgrams
andwaveforms
I feel like there is a weak consensus. There's a sort of consensus, a bit weak that it wasn't finalized yet. ~Like Brexit~. If there's no other popular candidate, I think we're good to go.
mag_mel_specgram
: wouldn't it be (always) a magnitude one? hypothetically, it looks alright though. 👍
duration
I missed mentioning that it's great to have a dedicated word for things in seconds! 👍
num_freqs
vsnum_bins
Yeah, neither is perfect. I googled num_freqs
and their cousins, and I'm little more confident now that it's more specific/precise than _bins
. With a plural form, hopefully it's not too confusing with min_freq
and max_freq
?
num_mels
is too specific as a replacement,
Yes, so num_mels
is for the case when we're sure it's only about mel filterbanks. For more general filterbanks, I'm fine with any of num_bands
, num_filters
, num_filterbanks
.
Last question - n_
vs num_
. For a few reasons, I think it's not a bad idea to go back to n_
. 1) things are already quite verbose, so if we can make names slightly shorter why not. 2) with specifying other concepts using _duration
and _length
, it became clearer that n_
indicates number_of
. 3) the current pytorch/audio is using n_
, hence it's a minimal change.
mag_specgram:
specific spectrogram that only holds magnitudes
just wondered how we name outputs with power > 1
(such as the output of complex_norm(.., power=2)
:
psd_specgram
or power_specgram
?
I would vote for the latter
power_specgram
sounds good to me, too.
num_bins: the number of bins in a linear spectrogram (i.e., num_bins == fft_length // 2 + 1)
did we actually miss num_frames
for the time dimension in the listing above or have decide on something different?
Oh, i think we missed it. num_frames
seems fine for me.
okay, updated...
waveform
: a tensor of audio samples
waveforms
: a batch of the former
sample_rate
: the rate of audio samples (samples per second)
specgram
: a tensor of spectrogram or mel/CQT spectrogram frames
specgrams
: a batch of the former
mag_specgram
: specific spectrogram that only holds magnitudes
power_spectrogram
: specific spectrogram that holds energy such as mag_specgram**2
phase_specgram
: specific spectrogram that only holds phases
complex_specgram
: specific spectrogram that holds complex values
mel_specgram
: a mel spectrogram
mag_mel_specgram
: a mel spectrogram with magnitudes only
frame_length
: the length of a spectrogram frame in terms of audio samples
frame_duration
: the length of a spectrogram frame in terms of seconds (can be float
or fractions.Fraction
)
fft_length
: the size of the FFT in terms of audio samples; using zero-padding if longer than frame_length
fft_duration
: the size of the FFT in terms of seconds
frame_rate
: the rate of spectrogram frames (frames per second)
hop_length
: the number of samples between the starts of consecutive frames (i.e., frame_rate == sample_rate / hop_length
)
hop_duration
: the number of seconds between the starts of consecutive frames (i.e., frame_rate == 1 / hop_duration
)
num_bins
: the number of bins in a linear spectrogram (i.e., num_bins == fft_length // 2 + 1
)
num_frames
: the number of timesteps/frames in spectrogram
num_bands
: the number of bands in a mel/CQT spectrogram
min_freq
: the lowest frequency of the lowest band in a mel/CQT spectrogram
max_freq
: the highest frequency of the highest band in a mel/CQT spectrogram
as you all seem to like samples and frames, how do we name the number of samples in a batch? ;-)
num_freqs
vsnum_bins
We go for num_bins
? I still think it's fine to use num_freqs
and slightly clearer than num_bins
. bins
sounds like... histogram :P
mag_mel_specgram
: wouldn't it be (always) a magnitude one?
No, not necessarily. It depends on how you compute it. But most functions won't care, and will just call it mel_specgram
or specgram
.
For a few reasons, I think it's not a bad idea to go back to
n_
.
Fine with me. I'm biased because I use num_
in my own code, but I assume n_
won't be confusing. Other opinions?
Yeah, neither is perfect. I googled
num_freqs
and their cousins, and I'm little more confident now that it's more specific/precise than_bins
.
Nice idea. Ok, I'm fine with _freqs
!
as you all seem to like samples and frames, how do we name the number of samples in a batch?
batchsize
as usual :) You're right, this doesn't fit the remaining names, but that term is very common. I think it won't be a problem.
So:
waveform
: a tensor of audio samples
waveforms
: a batch of the former
sample_rate
: the rate of audio samples (samples per second)
batchsize
: the number of examples in a batch
specgram
: a tensor of spectrogram or mel/CQT spectrogram frames
specgrams
: a batch of the former
mag_specgram
: specific spectrogram that only holds magnitudes
power_spectrogram
: specific spectrogram that holds energy such as mag_specgram**2
phase_specgram
: specific spectrogram that only holds phases
complex_specgram
: specific spectrogram that holds complex values
mel_specgram
: a mel spectrogram
mag_mel_specgram
: specific mel spectrogram that only holds magnitudes
frame_length
: the length of a spectrogram frame in terms of audio samples
frame_duration
: the length of a spectrogram frame in terms of seconds (can be float
or fractions.Fraction
)
fft_length
: the size of the FFT in terms of audio samples; using zero-padding if longer than frame_length
fft_duration
: the size of the FFT in terms of seconds
frame_rate
: the rate of spectrogram frames (frames per second)
hop_length
: the number of samples between the starts of consecutive frames (i.e., frame_rate == sample_rate / hop_length
)
hop_duration
: the number of seconds between the starts of consecutive frames (i.e., frame_rate == 1 / hop_duration
)
n_freqs
: the number of bins in a linear spectrogram (i.e., n_freqs == fft_length // 2 + 1
)
n_frames
: the number of timesteps/frames in spectrogram
n_bands
: the number of bands in a mel/CQT spectrogram
min_freq
: the lowest frequency of the lowest band in a mel/CQT spectrogram
max_freq
: the highest frequency of the highest band in a mel/CQT spectrogram
Is num_freqs
really specific to linear-frequency spectrograms, and num_bands
to nonlinear-frequency spectrograms? Sometimes we may not want to distinguish the two, e.g., a HPSS or a crossfade shouldn't care. Off the top of my head, the only time we need to distinguish them will be when creating filter banks. In that case, we could call it n_freqs_in
and n_freqs_out
, or n_freqs
and n_filters
. Looking at the list, we may want to reduce the number of terms to learn. What do you think?
Originally posted by @f0k in https://github.com/keunwoochoi/torchaudio-contrib/pull/34