Odd Audio adquisition pipeline

jesus-villalba commented 4 years ago

Hi, This is about the audio preprocessing function in the Sincnet example. In MITRE's slides, it indicates that adv examples would be 5sec chunks, while librispeech are variable length recordings with duration>= 1.5 sex.

However, in the sincnet example, you read the full utterance and then you just randomly take a single 375/2=187.5 milliseconds frame and discard the rest. That is less than a syllable (~250 msecs).

I assume that this odd pipeline is the result of a misunderstanding reading Mirco's code. He trains his network in a frame by frame basis and the forward function of his model only takes single frame batches. But when he evaluates, he evaluates all the frames of the recording with a 10msecs shift. See lines 290-309 in https://github.com/hkakitani/SincNet/blob/master/speaker_id.py He gets frame level decision and combine them to obtain an utterance level decision in line 316 [val,best_class]=torch.max(torch.sum(pout,dim=0),0)

It also puzzles me, why MITRE hardcoded 8 kHz and 375 for librispeech considering that, in Mirco's Timit example, he used 200ms window and 16kHz.

In summary, for evaluating our own models, we need to know if we are allowed to use the full utterances, or if we should use 5secs out of the full utt. as we inferred from MITRE's slides, or any other duration. This should be documented somewhere. From my point of view, using 187 msecs is not realistic. Thanks

davidslater commented 4 years ago

@hkakitani including you as you were the main person working on the SincNet model. I don't recall the underlying reason here, but perhaps it was due to train / inference time differences?

hkakitani commented 4 years ago

Currently looking into how SincNet evaluates input and fixing the length of the adversarial examples.

As for the hard coded sample rate and window length, that is based on Mirco's LibriSpeech example as seen here: https://github.com/mravanelli/SincNet/blob/master/cfg/SincNet_Librispeech.cfg#L11-L12.

davidslater commented 4 years ago

The main concern was not the training time slicing, but the inference time slicing (instead of using the entire input at inference time).

hkakitani commented 4 years ago

The difference between training time slicing and inference time slicing was my mistake, I misunderstood how SincNet was slicing the input for evaluation. The full audio should be used for inference, but the current version in Armory only uses a small section of the input.

yusong-tan commented 4 years ago

@jesus-villalba We are modifying the preprocessing to return, for each clip, an integer number of non-overlapping 187.5ms segments, i.e., the whole clip less some trailing samples. With that change, we also needed to update our adversarial datasets so that whole clips are attacked instead of just 187.5 ms segment in a clip. We will roll out the changes in the next Armory update.

jesus-villalba commented 4 years ago

Does that mean that the PyTorch model will get a matrix of (num_frames, 187.5 16)? How will you handle the fact that you have to set a fixed shape when you create the PyTorch wrapper? What I'm doing now it is to fix the length to 5 secs, if the utterance is shorter I tile the audio up to 5 secs. So I set the shape in the wrapper to (516000). In my system, the frame slicing happens inside of the forward function.

davidslater commented 4 years ago

It requires different train and inference phase code, including an if/else statement in the forward function and I think resetting the input shape when set_learning_phase is called (https://github.com/twosixlabs/armory/blob/master/armory/scenarios/audio_classification.py#L67) - setting it up in the classifier, not in the scenario.

yusong-tan commented 4 years ago

What we are going to do with audio is similar to how we're dealing with variable length inputs in the video scenario.

The preprocessing will return a list of clips, where each clip will be represented as (variable_num_frames, 3000). The output list will have length = batch_size.

During training, each clip will have one of its rows randomly sampled in the scenario, resulting in an ndarray of (batch_size, 3000). The input to the classifier will be (batch_size, 3000).

During inference, predictions over multiple frames will be run for each clip and averaged. In this phase, the input to the classifier will be of the shape (variable_num_frames, 3000).

The PyTorch wrapper will still have an input shape of (3000,).

Does this address the concern?

jesus-villalba commented 4 years ago

So you will average the outputs the frames in the https://github.com/twosixlabs/armory/blob/master/armory/scenarios/audio_classification.py? and the batch-size parameter will only affect the training step while in test phase recordings will be evaluated one by one, with batch-size = num-frames? However, if a system already provides a single output per recording, the pipeline won't break, right?

yusong-tan commented 4 years ago

The batch_size parameter will affect both training and inference. On the inference side, we'll iterate over each clip in the batch, and each clip will have its predictions averaged over its own frames - but as far as the classifier is concerned, variable_num_frames replace batch_size on axis=0.

As for a system that provides a single output per recording, are you referring to outputting a shape of (5*16000,) using your own preprocessing_fn? If so, to work with the updated scenario, it would best to truncate it and reshape it to (N, 3000).

jesus-villalba commented 4 years ago

Yes, since my system averages the frame-level representations in the middle of the network, not in the output.

yusong-tan commented 4 years ago

Let me know if my understanding of your system is correct. Your system doesn't modify the inputs, of shape (N, 3000), of the classifier. During inference, it averages mid-level representations so the model outputs, for each clip, a single prediction instead of N predictions? If this is correct, then the pipeline should still work as averaging over a single prediction will just return that prediction.

jesus-villalba commented 4 years ago

that's correct

twosixlabs / armory

Odd Audio adquisition pipeline #591