Closed hcfeng201 closed 5 years ago
backward is the step to compute gradient so it's supposed to be the most expensive step.
The backward is not associated with the dataset, but associated with the size of input.
You can try to break the dataset into subsets and call the fit
function multiple times as suggested in README.md
.
backward is the step to compute gradient so it's supposed to be the most expensive step.
The backward is not associated with the dataset, but associated with the size of input.
You can try to break the dataset into subsets and call the
fit
function multiple times as suggested inREADME.md
.
Sir, I break the dataset into 2 subsets, each subsets include 41000 elements which Smaller than the test train_sequence you provided (47350).But it still cost 8 seconds in each iteration, and with the data you provide, it took less than a second.
Is 41000 the number of time steps, or the feature dimension? If former, what is your feature dimension?
Also, did you normalize the features before you feed them into uisrnn? If not, what's the range of your features?
Is 41000 the number of time steps, or the feature dimension? If former, what is your feature dimension?
Also, did you normalize the features before you feed them into uisrnn? If not, what's the range of your features?
41000 is the feature dimension of a 2-dim numpy array (41000, 256), like your feature dimension(47350, 256). The normalize you mentioned is "The embedding vector (d-vector) is defined as the L2 normalization of the network output"? I extracted d-vector by "PyTorch_Speaker_Verification". I think the normalize is done.
You will need to discuss this with the author of PyTorch_Speaker_Verification.
We are not responsible for the correctness or any issue of third-party libraries.
@hcfeng201 you should change the embedding create dome: for file in os.listdir(folder): if file[-4:] == '.wav':
print(folder + '/' + file)
times, segs = VAD_chunk(2, folder + '/' + file)
print("times" * 10, times)
print("segs" * 10)
if segs == []:
print('No voice activity detected')
continue
concat_seg = concat_segs(times, segs)
STFT_frames = get_STFTs(concat_seg)
STFT_frames = np.stack(STFT_frames, axis=2)
STFT_frames = torch.tensor(np.transpose(STFT_frames, axes=(2, 1, 0)))
embeddings = embedder_net(STFT_frames)
# print(embeddings)
aligned_embeddings = align_embeddings(embeddings.detach().numpy())
train_sequence.append(aligned_embeddings)
for embedding in aligned_embeddings:
train_cluster_id.append(str(label))
label += 1
test_sequence = np.concatenate(train_sequence, axis=0)
test_cluster_id = np.asarray(train_cluster_id)
np.save('test_sequence', test_sequence)
np.save('test_cluster_id', test_cluster_id)
print("%" * 100)
print(test_sequence.shape, type(test_sequence))
and change uis-rnn test demo: test_sequence = np.load('./data/test_sequence.npy') test_cluster_id = np.load('./data/test_cluster_id.npy')
model = uisrnn.UISRNN(model_args)
model.load(SAVED_MODEL_NAME)
print("%" * 100) print(test_sequence.shape, type(test_sequence)) print(test_cluster_id, type(test_cluster_id))
predicted_label = model.predict(test_sequence, inference_args) predicted_labels.append(predicted_label) accuracy = uisrnn.compute_sequence_match_accuracy(list(test_cluster_id), predicted_label) test_record.append((accuracy, len(test_cluster_id))) print('Ground truth labels:') print(test_cluster_id) print('Predicted labels:') print(predicted_label) print('-' * 80) output_string = uisrnn.output_result(model_args, training_args, test_record) print('Finished diarization experiment') print(output_string)
Hi, I've tried to use LibriSpeech to train the model, and I found that "backward" step (loss.backward()) took the longest time in each iteration (Almost 95% of the time). And the larger the datasets, the more time is consumed. Is that normal? Why is backward associated with data? Thank you in advance.