keunwoochoi / torchaudio-contrib

A test bed for updates and new features | pytorch/audio
169 stars 22 forks source link

Parameter names: sr, n_mels, n_stft, f_max, f_min #36

Open ksanjeevan opened 5 years ago

ksanjeevan commented 5 years ago

I always find shortened parameter names a bit cryptic (sr, n_mels, n_stft, f_max, f_min). Do we want to follow librosa here? If we do, we should do it correctly; if we don't, we can choose more verbose names (sample_rate, fft_bins, mel_bands, min_freq, max_freq, although mel_bands/n_mels is specific to the type of filterbank, and should maybe be called num_bands/n_bands to be the same for every filter bank subclass).

I also find it convenient sometimes to make the sample rate optional, and expect min_freq and max_freq to be given relative to the Nyquist rate if the sample rate is not given; like scipy.signal does it: https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.firwin.html#scipy.signal.firwin

Originally posted by @f0k in https://github.com/keunwoochoi/torchaudio-contrib/pull/34

hagenw commented 5 years ago

I would be in favor of more verbose names and am not a big fan of the current librosa argument names.

f0k commented 5 years ago

I would be in favor of more verbose names and am not a big fan of the current librosa argument names.

Great. Also note the first goal in the README: "Clear, readible names for class/functions/arguments"

Let's collect some suggestions then. Collecting things from my own code and making them more uniform or more consistent with torch.stft: samples: a tensor of audio samples sample_rate: the rate of audio samples (samples per second) spect: a tensor of spectrogram or mel/CQT spectrogram frames frame_len frame_length: the length of a spectrogram frame in terms of audio samples fps frame_rate: the rate of spectrogram frames (frames per second) hop_size hop_length: the number of samples between the starts of consecutive frames (i.e., frame_rate == sample_rate / hop_length) num_bins: the number of bins in a spectrogram num_bands: the number of bands in a mel/CQT spectrogram min_freq: the lowest frequency of the lowest band in a mel/CQT spectrogram max_freq: the highest frequency of the highest band in a mel/CQT spectrogram

hop_length was taken from torch.stft. frame_length was for consistency with frame_rate, which I find a more important concept than the hop_length, since it's independent of the sample rate. torch.stft calls it win_length. If we go with win_length, we'll have win_rate. Not so nice.

torch.stft has n_fft. I would call this fft_length, and make frame_length the primary parameter to set, with fft_length defaulting to match it, rather than the other way round. I'd then define num_bins = fft_length // 2 + 1.

(If we have sample_rate for samples, then frame_rate should be for frames, not for spect, but that's maybe too generic then...)

I'm not sure if num_bins and num_bands should be separate, since they're both describing part of the shape of a spectrogram. But if both of them are called num_bands, we may run into problems naming the arguments for frequency-domain filterbanks?

Oh, another thing: We may want to differentiate between lengths in terms of discrete samples/frames, and lengths in terms of seconds. I would suggest size for the former, and length for the latter, but then we'd move away from torch.stft. Is this worth it? Since we'll provide our own wrapper for torch.stft in order to support higher-dimensional tensors, we'd be free to change argument names.

Discuss! :)

hagenw commented 5 years ago

Some comments on a few of them: samples: this could be confused with number_of_samples. A better name might be signal, but it might depend on the actual class, method where you using it. sample_rate: I never can decide between sampling rate and sample rate, but I guess the seconds wins as it is shorter?

f0k commented 5 years ago

samples: this could be confused with number_of_samples

I think the latter should be num_samples with analogy to num_bins, num_bands. If we rename samples to signal (which I'd also be okay with, although I find samples more explicit), then we still shouldn't shorten num_samples to just samples because of the same ambiguity.

I never can decide between sampling rate and sample rate, but I guess the seconds wins as it is shorter?

Yes, interesting. So that's something that sr avoids :) Another argument for sample_rate would be the consistency with samples and num_samples, and with frame_rate (not framing_rate).

hagenw commented 5 years ago

Totally agree, I would not propose to shorten num_samples to samples. The proposal was more to not use samples at all as new and non-familiar users might confuse it.

I see the point that sample_rate fits better with frame_rate, I was just wondering if "sample rate" or "sampling rate" is better English ;)

f0k commented 5 years ago

The proposal was more to not use samples at all as new and non-familiar users might confuse it.

Ok. Another confusion might be with training samples / training examples. I'd also be okay with signal and signal_size, signal_length. Hmm, signal sounds like something continuous and eternal, but here we deal with discretized, finite excerpts. For this reason, I still like samples and num_samples, but then what's the name for the length of samples in terms of seconds?

Oh, these name discussions :) They can take just as long as defining the function signatures, but I think they are very important to design a good API.

faroit commented 5 years ago

Another confusion might be with training samples / training examples.

I don't really want to go deeply into your name discussion here... but I would be very much in favor of not using samples here at all, as well.

So often it happened that I supervised people in ML+Audio and they always were confused by what we audio-heads are calling samples.

If we need vocabulary for sequential samples or frames I would just leave this to timesteps closely following the RNN syntax.

Also, additionally avoiding frames is :+1: because even in DSP people can't agree on the definitions (spectrogram frames vs. samples*channels in the codec world).

keunwoochoi commented 5 years ago

timesteps

Does it mean we're gonna use something like timestep_rate for sample rate? My opinion is to keep using sample and sample_rate, feeling like giving up those names are a bit of stretch. LET'S BE AUDIOHEADS.

size vs len(gth)

How about n_ or num_ for discrete (instead of size) and use len(gth) for things in [second]? size feels like something potentially multi-dimensional. And -- stuff like n_fft, frame_len would be changed accordingly, right? One more good thing about this is, on n_fft issue, we can keep using n_fft so that it's matched to torch.stft, which is great. num_fft would be also good, at least that's pretty similar to n_fft, no one would get confused.

f0k commented 5 years ago

I don't really want to go deeply into your name discussion here...

Oh, please do! Names are important and hard to change.

but I would be very much in favor of not using samples here at all, as well.

signal it is, then? Another advantage of signal is that you can still form a plural of it, this doesn't work with samples.

timesteps

Does it mean we're gonna use something like timestep_rate for sample rate?

As I understood, this only applies when we need a generic term that encompasses both. For functions that clearly only take raw audio samples, or clearly only take spectrograms, we would use the more specific terms for clarity. I'm not sure if timesteps is a good general term, to me it sounds like it would only refer to time stamps, not the associated data. And instead of timestep_rate, I'd lean towards only using rate when we need a generalization of sample_rate and frame_rate.

Also, additionally avoiding frames is +1 because even in DSP people can't agree on the definitions

So spect is good, but what about frame_rate?

size feels like something potentially multi-dimensional.

Mhm, true, torch.Tensor.size()... but that also allows you to give the size of a single dimension. I think the dimension would be clear from the context. Do you have a counter-example?

How about n_ or num_ for discrete (instead of size) and use len(gth) for things in [second]?

But num_ is a prefix, and length doesn't work as a prefix. We could use _num as a suffix. But even then, length and size are exchangeable in a way that length and num aren't: frame_length - frame_size - frame_num? signal_length - signal_size - signal_num? spect_length - spect_size - spect_num? But I see that spect_size could refer to the two-dimensional shape. More ideas?

And -- stuff like n_fft, frame_len would be changed accordingly, right?

Yes, that was the idea, we'd find a consistent scheme for naming things and use it throughout.

One more good thing about this is, on n_fft issue, we can keep using n_fft so that it's matched to torch.stft, which is great.

I don't think we need to match torch.stft. Users of torchaudio will not need to use torch.stft directly, and may not even know about it. Also, the argument names of torch.stft are the way they are just because they were changed to follow librosa; before pytorch/pytorch#9308 it was called fft_size.

f0k commented 5 years ago

New idea:

I like _len for its correspondence to len(), but I don't really like _dur because it's slightly cryptic. On the other hand, I assume many people think _duration is too long, even in times of autocompletion?

Complete next proposal: signal: a tensor of audio samples sample_rate: the rate of audio samples (samples per second) frame_len: the length of a spectrogram frame in terms of audio samples frame_dur: the length of a spectrogram frame in terms of seconds (can be float or fractions.Fraction) fft_len: the size of the FFT in terms of audio samples; using zero-padding if longer than frame_len fft_dur: the size of the FFT in terms of seconds spect: a tensor of spectrogram or mel/CQT spectrogram frames frame_rate: the rate of spectrogram frames (frames per second) hop_len: the number of samples between the starts of consecutive frames (i.e., frame_rate == sample_rate / hop_len) hop_dur: the number of seconds between the starts of consecutive frames (i.e., frame_rate == 1 / hop_dur) num_bins: the number of bins in a spectrogram num_bands: the number of bands in a mel/CQT spectrogram min_freq: the lowest frequency of the lowest band in a mel/CQT spectrogram max_freq: the highest frequency of the highest band in a mel/CQT spectrogram

faroit commented 5 years ago

sample_rate: the rate of audio samples (samples per second)

since sample is only used here. Can we just do rate?

otherwise: 👍

hagenw commented 5 years ago

I would go for length and duration as dur is not obvious. I would not go for rate as it is not obvious which rate is meant.

f0k commented 5 years ago

I would go for length and duration as dur is not obvious.

I like long names anyway. Others?

Can we just do rate?

Not easily, we need to discuss this.

  1. There are functions that need both. If we compute a spectrogram, we may need to give both the input sample rate and the output frame rate. (I imagine that many arguments can be used interchangeably, whatever is more convenient for the use case, so we may need both rates to convert durations to lengths.)
  2. There are functions where rate is ambiguous. If we compute a mel filterbank from min_freq and max_freq given in Hz, we need to know the sample rate (or the Nyquist frequency), but not the frame rate, although it's about computing a mel spectrogram.

If we want to avoid sample_rate and frame_rate (you argued against using the term "frames" before), we would need something like input_rate and output_rate in functions that change rates. But what do we do for the mel filterbank? We can also go for signal_rate and spect_rate. Sounds a bit weird, but would be consistent with the argument names. Also note that if we just use rate, we will use it throughout, when loading audio, when resampling it, whenever. Could be good, or could be unclear.

keunwoochoi commented 5 years ago

Agreed overall.

frame_duration or frame_dur

but when do we use it?

f0k commented 5 years ago

Agreed overall.

But there were still some open questions.

frame_duration or frame_dur

but when do we use it?

When we define a model that should be applicable to both 22 kHz and 44 kHz files, for example. We can either force users to do the conversion, or provide an interface that allows either lengths or durations.

keunwoochoi commented 5 years ago

When we define a model that should be applicable to both 22 kHz and 44 kHz files,

Wouldn't it be quite a high-level model than we'd add as a pytorch/audio?

keunwoochoi commented 5 years ago

On rate vs sample_rate: Seems like sample_rate is getting the most votes.

keunwoochoi commented 5 years ago

So far we've been already adopting what @f0k summarized up there. One thing, that I once opened an separated issue about #46, is names for spectrograms.

spect is not bad, but I'd like to suggest that we should specify little more. My suggestions are based on this idea:

The variable names would be

This change would be applied for both argument names and variable names so that both APIs and codes would be easier to read, less error-prone, more intuitive.

keunwoochoi commented 5 years ago

One more thing.

num_bins: the number of bins in a spectrogram num_bands: the number of bands in a mel/CQT spectrogram

I'd vote for num_freqs for the number of (linear) frequency bins, num_mels for the number of mel bins. In my recent PR (#58) I adopted it without checking this issue, but when reading the code I was quite confused because both "freq" and "mel" can be combined with "bins" and "bands".

faroit commented 5 years ago

num_freqs

:+1:

complex_specgrams

I still don't like this as mentioned earlier... i just had a new idea: what about tf...

mag_tf phase_tf complex_tf

that could easily also fit for both STFT and MEL

keunwoochoi commented 5 years ago

umm, the word itself is not bad. but i'm little nervous about inventing a new one, because the naming was about things being readable without searching for the meaning at all.

~also it looks like tensorflow :P~

that said, yeah, your reason is totally valid. hm..

hagenw commented 5 years ago

@faroit could you please refer to your earlier post as I was not able to find it in this thread.

I think spectrogram refers more to the visual representation of the actual time-frequency representation. Is this the reason you would prefer the term tf?

I'm not completely convinced by tf as

f0k commented 5 years ago

Is there a consensus on using specgram and waveform? I liked the shorter spect and signal better, but I can't argue against the longer names (except that they're longer).

I find tf too cryptic, to me this is on the line of the variables in the title that we wanted to get away from.

When we define a model that should be applicable to both 22 kHz and 44 kHz files,

Wouldn't it be quite a high-level model than we'd add as a pytorch/audio?

Not sure. This is something that could be handled by the mel spectrogram computation, and if we claim this to be the responsibility of torchaudio, it should provide a way to do so.

I'd vote for num_freqs for the number of (linear) frequency bins, num_mels for the number of mel bins.

I find num_freqs a bit unintuitive, it sounds as if counting the frequencies, not the bins. But I agree that bins might be unclear to somebody not used to work with spectrograms. num_mels is okay, but what would you replace num_bands with? num_mels is too specific as a replacement, it does not work for other filterbanks than the mel filterbank. num_filters?

Can we try returning to doing complete proposals (even realizing that I'm the only one doing so)? waveform: a tensor of audio samples waveforms: a batch of the former sample_rate: the rate of audio samples (samples per second) specgram: a tensor of spectrogram or mel/CQT spectrogram frames specgrams: a batch of the former mag_specgram: specific spectrogram that only holds magnitudes phase_specgram: specific spectrogram that only holds phases complex_specgram: specific spectrogram that holds complex values mel_specgram: a mel spectrogram mag_mel_specgram: a mel spectrogram with magnitudes only frame_length: the length of a spectrogram frame in terms of audio samples frame_duration: the length of a spectrogram frame in terms of seconds (can be float or fractions.Fraction) fft_length: the size of the FFT in terms of audio samples; using zero-padding if longer than frame_length fft_duration: the size of the FFT in terms of seconds frame_rate: the rate of spectrogram frames (frames per second) hop_length: the number of samples between the starts of consecutive frames (i.e., frame_rate == sample_rate / hop_length) hop_duration: the number of seconds between the starts of consecutive frames (i.e., frame_rate == 1 / hop_duration) num_bins: the number of bins in a linear spectrogram (i.e., num_bins == fft_length // 2 + 1) num_bands: the number of bands in a mel/CQT spectrogram min_freq: the lowest frequency of the lowest band in a mel/CQT spectrogram max_freq: the highest frequency of the highest band in a mel/CQT spectrogram

I just realized that I find num_freqs a bit confusing in conjunction with min_freq and max_freq. So I went back to num_bins. Better ideas?

keunwoochoi commented 5 years ago

Ok, we're getting there!

consensus on specgrams and waveforms

I feel like there is a weak consensus. There's a sort of consensus, a bit weak that it wasn't finalized yet. ~Like Brexit~. If there's no other popular candidate, I think we're good to go.

mag_mel_specgram

: wouldn't it be (always) a magnitude one? hypothetically, it looks alright though. 👍

duration

I missed mentioning that it's great to have a dedicated word for things in seconds! 👍

num_freqs vs num_bins

Yeah, neither is perfect. I googled num_freqs and their cousins, and I'm little more confident now that it's more specific/precise than _bins. With a plural form, hopefully it's not too confusing with min_freq and max_freq?

num_mels is too specific as a replacement,

Yes, so num_mels is for the case when we're sure it's only about mel filterbanks. For more general filterbanks, I'm fine with any of num_bands, num_filters, num_filterbanks.


Last question - n_ vs num_. For a few reasons, I think it's not a bad idea to go back to n_. 1) things are already quite verbose, so if we can make names slightly shorter why not. 2) with specifying other concepts using _duration and _length, it became clearer that n_ indicates number_of. 3) the current pytorch/audio is using n_, hence it's a minimal change.

faroit commented 5 years ago

mag_specgram: specific spectrogram that only holds magnitudes

just wondered how we name outputs with power > 1 (such as the output of complex_norm(.., power=2):

psd_specgram or power_specgram?

I would vote for the latter

keunwoochoi commented 5 years ago

power_specgram sounds good to me, too.

faroit commented 5 years ago

num_bins: the number of bins in a linear spectrogram (i.e., num_bins == fft_length // 2 + 1)

did we actually miss num_frames for the time dimension in the listing above or have decide on something different?

keunwoochoi commented 5 years ago

Oh, i think we missed it. num_frames seems fine for me.

faroit commented 5 years ago

okay, updated...

waveform: a tensor of audio samples waveforms: a batch of the former sample_rate: the rate of audio samples (samples per second) specgram: a tensor of spectrogram or mel/CQT spectrogram frames specgrams: a batch of the former mag_specgram: specific spectrogram that only holds magnitudes power_spectrogram: specific spectrogram that holds energy such as mag_specgram**2 phase_specgram: specific spectrogram that only holds phases complex_specgram: specific spectrogram that holds complex values mel_specgram: a mel spectrogram mag_mel_specgram: a mel spectrogram with magnitudes only frame_length: the length of a spectrogram frame in terms of audio samples frame_duration: the length of a spectrogram frame in terms of seconds (can be float or fractions.Fraction) fft_length: the size of the FFT in terms of audio samples; using zero-padding if longer than frame_length fft_duration: the size of the FFT in terms of seconds frame_rate: the rate of spectrogram frames (frames per second) hop_length: the number of samples between the starts of consecutive frames (i.e., frame_rate == sample_rate / hop_length) hop_duration: the number of seconds between the starts of consecutive frames (i.e., frame_rate == 1 / hop_duration) num_bins: the number of bins in a linear spectrogram (i.e., num_bins == fft_length // 2 + 1) num_frames: the number of timesteps/frames in spectrogram num_bands: the number of bands in a mel/CQT spectrogram min_freq: the lowest frequency of the lowest band in a mel/CQT spectrogram max_freq: the highest frequency of the highest band in a mel/CQT spectrogram

as you all seem to like samples and frames, how do we name the number of samples in a batch? ;-)

keunwoochoi commented 5 years ago

num_freqs vs num_bins

We go for num_bins? I still think it's fine to use num_freqs and slightly clearer than num_bins. bins sounds like... histogram :P

f0k commented 5 years ago

mag_mel_specgram

: wouldn't it be (always) a magnitude one?

No, not necessarily. It depends on how you compute it. But most functions won't care, and will just call it mel_specgram or specgram.

For a few reasons, I think it's not a bad idea to go back to n_.

Fine with me. I'm biased because I use num_ in my own code, but I assume n_ won't be confusing. Other opinions?

Yeah, neither is perfect. I googled num_freqs and their cousins, and I'm little more confident now that it's more specific/precise than _bins.

Nice idea. Ok, I'm fine with _freqs!

as you all seem to like samples and frames, how do we name the number of samples in a batch?

batchsize as usual :) You're right, this doesn't fit the remaining names, but that term is very common. I think it won't be a problem.

So:

waveform: a tensor of audio samples waveforms: a batch of the former sample_rate: the rate of audio samples (samples per second) batchsize: the number of examples in a batch specgram: a tensor of spectrogram or mel/CQT spectrogram frames specgrams: a batch of the former mag_specgram: specific spectrogram that only holds magnitudes power_spectrogram: specific spectrogram that holds energy such as mag_specgram**2 phase_specgram: specific spectrogram that only holds phases complex_specgram: specific spectrogram that holds complex values mel_specgram: a mel spectrogram mag_mel_specgram: specific mel spectrogram that only holds magnitudes frame_length: the length of a spectrogram frame in terms of audio samples frame_duration: the length of a spectrogram frame in terms of seconds (can be float or fractions.Fraction) fft_length: the size of the FFT in terms of audio samples; using zero-padding if longer than frame_length fft_duration: the size of the FFT in terms of seconds frame_rate: the rate of spectrogram frames (frames per second) hop_length: the number of samples between the starts of consecutive frames (i.e., frame_rate == sample_rate / hop_length) hop_duration: the number of seconds between the starts of consecutive frames (i.e., frame_rate == 1 / hop_duration) n_freqs: the number of bins in a linear spectrogram (i.e., n_freqs == fft_length // 2 + 1) n_frames: the number of timesteps/frames in spectrogram n_bands: the number of bands in a mel/CQT spectrogram min_freq: the lowest frequency of the lowest band in a mel/CQT spectrogram max_freq: the highest frequency of the highest band in a mel/CQT spectrogram

Is num_freqs really specific to linear-frequency spectrograms, and num_bands to nonlinear-frequency spectrograms? Sometimes we may not want to distinguish the two, e.g., a HPSS or a crossfade shouldn't care. Off the top of my head, the only time we need to distinguish them will be when creating filter banks. In that case, we could call it n_freqs_in and n_freqs_out, or n_freqs and n_filters. Looking at the list, we may want to reduce the number of terms to learn. What do you think?