taylorlu / Speaker-Diarization

speaker diarization by uis-rnn and speaker embedding by vgg-speaker-recognition
Apache License 2.0
455 stars 124 forks source link

What does the uisrnn pytorch model output exactly and which variable holds that output? #43

Open Harry-Garrison opened 3 years ago

Harry-Garrison commented 3 years ago

Excuse my ignornce, but I am trying to wrap my head around the inner workings of the uisrnn model and I am stuck. More specifically, I would like to know what the model outputs when it receives the VGG speaker embedding features. I struggle with this because the way the model is structured is too complicated; it all seems to be one continuous process and I cannot tell where the pytorch model's job starts and where it finishes. I looked into the uisrnn script and tried to trace the order in which the functions are executed, albeit with no success. To my understanding, the model outputs a sequence of "states" which are then processed and scored by a beam search algorithm. Then the scores are fed back into the model and the process continues until some certain point is reached (?).

Figuring out what the model does with the VGG speaker embeddings it receives is challenging to say the least. Problem is, I do not know where to start. Which parts of the inference process depend solely on the pytorch model and which parts of the code handle the rest (beam states, scoring etc)? Which part of the uisrnn script is responsible for processing the VGG embeddings and which variable holds the results thereof?

So far I have figured out the following:

This code loads the uisrnn object in memory and loads the weights.

model_args, _, inference_args = uisrnn.parse_arguments()
model_args.observation_dim = 512
uisrnnModel = uisrnn.UISRNN(model_args)
uisrnnModel.load(SAVED_MODEL_NAME)

This snippet runs inference on features (embeddings)

predicted_label = uisrnnModel.predict([feats], inference_args)

Now comes the hard part:

In the uisrnn script we get the following function:

def predict_single(self, test_sequence, args):
    ...

From that point on I have no idea what is going on. What does the model output after it has received the features and at which point in the code do we get the result of that computation? Is it the mean and hidden variables in:


class CoreRNN(nn.Module):
  """The core Recurent Neural Network used by UIS-RNN."""

  def __init__(self, input_dim, hidden_size, depth, observation_dim, dropout=0):
    super(CoreRNN, self).__init__()
    ...
    return mean, hidden

or is it something else? Most importantly, is the model feeding itself the features only or the maybe beam states too? I am so confused.

Understanding how exactly this code works could help with a variety of tasks, such as improving the code or turning the pytorch model into another format, in a more modular fashion. Any help is greatly appreciated.