Closed MuruganR96 closed 5 years ago
You are using the APIs in the wrong way.
I have updated the README.md with more detailed instructions.
For integration test, label_to_center should be a dict from string to 1-d vectors, not to numbers.
Also, you are not supposed to directly apply UIS-RNN to audio. Instead you should apply it on speaker discriminative embeddings like d-vectors.
@wq2012 thank you. how to pass d-vector embeddings? i referered this paper and github repo,
https://arxiv.org/pdf/1710.10467.pdf https://github.com/HarryVolek/PyTorch_Speaker_Verification
but i was confused,
Here train_sequence should be a 2-dim numpy array of type float, for the concatenated observation sequences.
For speaker diarization, this could be the d-vector embeddings.
For example, if you have M training utterances, and each utterance is a sequence of L embeddings.
Each embedding is a vector of D numbers.
Then the shape of train_sequence is N * D, where N = M * L.
train_sequence: 2-dim numpy array of real numbers, size: N * D
- the training observation sequence.
N - summation of lengths of all utterances
D - observation dimension
We concatenate all training utterances into a single sequence.
Note that the order of entries within an utterance are preserved,
and all utterances are simply concatenated together.
in this statement, where i am do this, D-vector training_sequence with UIS-RNN training. because in this PyTorch_Speaker_Verification i was created TIMIT Dataset D-vectors embeedings. but I don't know how to process this D-vector embeddings into our UIS-RNN?.
We concatenate all training utterances into a single sequence.
i was confusing in this line. respected sir what you meant ? how can i concatenate all training utterances into a single sequence.
i think i am not sure it is correct or not fully. i am a beginner this concept sir.
can you help me sir?
sir please help me. thank you for advance response
Concatenation means:
If:
train_sequence_1 = [E1, E2]
train_sequence_2 = [E3, E4, E5]
train_cluster_id_1 = ['1', '2']
train_cluster_id_2 = ['3', '4', '5']
Then:
train_sequence = [E1, E2, E3, E4, E5] # concatenated
train_cluster_id = ['1', '2', '3', '4', '5'] # concatenated
The reason that we concatenate is that we will be resampling training data and block-wise shuffling training data as a data augmentation process.
But yes, I admit this API is a little weird. We will change it in the future, as a long term plan.
About d-vectors embeddings, we are not responsible for any third-party implementations.
thank you so much sir.
About d-vectors embeddings, we are not responsible for any third-party implementations
then how can i generated d-vectors embeddings? sir give me some hint how to construct d-vector embeddings? the above one is useful or not sir?
i think now i am very clear about uis-rnn api and then architecture as well as.
but i can't move another step, because of that d-vector embeddings construct and intialize,
if u interested to help me, then give your suggestions sir.
thank you very much for your response sir.
Glad that the UIS-RNN API is clear to you.
You can use any third-party implementation of d-vector embeddings, or similar techniques like x-vectors from JHU. But we are not responsible for the quality of them. You need to directly ask the authors of those libraries on how to use them.
Some of the libraries are only able to produce per-utterance d-vector embeddings, while for UIS-RNN, we require continuous d-vector embeddings (as sequences). We have no guarantee which third-party library supports this. You need to do your own research here.
This GitHub repo is for the UIS-RNN library only.
@wq2012 i was recorded me and three different speakers audio for 4 sec, mono channel, 16k wav format file. note: if any restrictions is there or not audio duration, format, and size ? i given my audio file array as train_data, as well as test_data,
ValueError: all the input array dimensions except for the concatenation axis must match exactly
result = np.concatenate((result, label_to_center[id]))
again shows this error,
ValueError: not enough values to unpack (expected 2, got 1)
my audio sequence shape: np.shape(np.array(a[1],dtype=float)) --> (63488,) np.array(a[1],dtype=float) --> [ 0. 0. -2. ... 700. 687. 679.]
uis-rnn ./data/training_data.npz sequence shape: np.shape(sequence[sampled_idx_sets[j], :]) --> (39, 256) shape of sequence
if numpy shape is an issue then it will resolve automatically using utills.py but it can't reach that utils.resize_sequence function.
issue for audio sequence ,
(63906800, ) train_sequence.shape
train_total_length, observation_dim = train_sequence.shape
ValueError: not enough values to unpack (expected 2, got 1)
how to resolve this issue @wq2012 sir. advance thanks