hcfeng201 commented 5 years ago

Hi, I've tried to use LibriSpeech to train the model, and I found that "backward" step (loss.backward()) took the longest time in each iteration (Almost 95% of the time). And the larger the datasets, the more time is consumed. Is that normal? Why is backward associated with data? Thank you in advance.

wq2012 commented 5 years ago

backward is the step to compute gradient so it's supposed to be the most expensive step.

The backward is not associated with the dataset, but associated with the size of input.

You can try to break the dataset into subsets and call the fit function multiple times as suggested in README.md.

hcfeng201 commented 5 years ago

backward is the step to compute gradient so it's supposed to be the most expensive step.

The backward is not associated with the dataset, but associated with the size of input.

You can try to break the dataset into subsets and call the fit function multiple times as suggested in README.md.

Sir, I break the dataset into 2 subsets, each subsets include 41000 elements which Smaller than the test train_sequence you provided (47350).But it still cost 8 seconds in each iteration, and with the data you provide, it took less than a second.

wq2012 commented 5 years ago

Is 41000 the number of time steps, or the feature dimension? If former, what is your feature dimension?

Also, did you normalize the features before you feed them into uisrnn? If not, what's the range of your features?

hcfeng201 commented 5 years ago

Is 41000 the number of time steps, or the feature dimension? If former, what is your feature dimension?

Also, did you normalize the features before you feed them into uisrnn? If not, what's the range of your features?

41000 is the feature dimension of a 2-dim numpy array (41000, 256), like your feature dimension(47350, 256). The normalize you mentioned is "The embedding vector (d-vector) is defined as the L2 normalization of the network output"? I extracted d-vector by "PyTorch_Speaker_Verification". I think the normalize is done.

wq2012 commented 5 years ago

You will need to discuss this with the author of PyTorch_Speaker_Verification.

We are not responsible for the correctness or any issue of third-party libraries.

Aurora11111 commented 5 years ago

@hcfeng201 you should change the embedding create dome: for file in os.listdir(folder): if file[-4:] == '.wav':

subprocess.call(['ffmpeg', '-i', 'file', file[-4:]+'.wav'])

    print(folder + '/' + file)
    times, segs = VAD_chunk(2, folder + '/' + file)
    print("times" * 10, times)
    print("segs" * 10)

    if segs == []:
        print('No voice activity detected')
        continue
    concat_seg = concat_segs(times, segs)
    STFT_frames = get_STFTs(concat_seg)
    STFT_frames = np.stack(STFT_frames, axis=2)
    STFT_frames = torch.tensor(np.transpose(STFT_frames, axes=(2, 1, 0)))
    embeddings = embedder_net(STFT_frames)
    # print(embeddings)
    aligned_embeddings = align_embeddings(embeddings.detach().numpy())
    train_sequence.append(aligned_embeddings)
    for embedding in aligned_embeddings:
        train_cluster_id.append(str(label))
    label += 1
    test_sequence = np.concatenate(train_sequence, axis=0)
    test_cluster_id = np.asarray(train_cluster_id)

np.save('test_sequence', test_sequence)
np.save('test_cluster_id', test_cluster_id)
print("%" * 100)
print(test_sequence.shape, type(test_sequence))

and change uis-rnn test demo: test_sequence = np.load('./data/test_sequence.npy') test_cluster_id = np.load('./data/test_cluster_id.npy')

model = uisrnn.UISRNN(model_args)

model.load(SAVED_MODEL_NAME)

testing

print("%" * 100) print(test_sequence.shape, type(test_sequence)) print(test_cluster_id, type(test_cluster_id))

for (test_sequence, test_cluster_id) in zip(test_sequences, test_cluster_ids):

predicted_label = model.predict(test_sequence, inference_args) predicted_labels.append(predicted_label) accuracy = uisrnn.compute_sequence_match_accuracy(list(test_cluster_id), predicted_label) test_record.append((accuracy, len(test_cluster_id))) print('Ground truth labels:') print(test_cluster_id) print('Predicted labels:') print(predicted_label) print('-' * 80) output_string = uisrnn.output_result(model_args, training_args, test_record) print('Finished diarization experiment') print(output_string)

google / uis-rnn

Question on time cost during each iteration #27

subprocess.call(['ffmpeg', '-i', 'file', file[-4:]+'.wav'])

testing

for (test_sequence, test_cluster_id) in zip(test_sequences, test_cluster_ids):