google / uis-rnn

This is the library for the Unbounded Interleaved-State Recurrent Neural Network (UIS-RNN) algorithm, corresponding to the paper Fully Supervised Speaker Diarization.
https://arxiv.org/abs/1810.04719
Apache License 2.0
1.55k stars 320 forks source link

run_test.sh problem #14

Closed 77281900000 closed 5 years ago

77281900000 commented 5 years ago

Hi, I run this demo.py twice. The first time it works well,and its accurancy is 1.But when i delete the model and try again it works but its result is only about 0.8.I'm sure i didn't change the program.I have tried to deleted the whole program and git clone it again. Its result is still about 0.8.Then I run run_test.sh and got a error,

====================================================================== FAIL: test_four_clusters (main.TestIntegration) Four clusters on vertices of a square.

Traceback (most recent call last): File "tests/integration_test.py", line 99, in test_four_clusters self.assertEqual(1.0, accuracy) AssertionError: 1.0 != 0.9


Ran 1 test in 17.543s

FAILED (failures=1)

There must be something strange happens.Could anyone tell me why could lead to this happen? Thanks.

wq2012 commented 5 years ago

Thanks for reporting. We will look into it.

wq2012 commented 5 years ago

Hi @gaodihe , could you let us know your versions of python, numpy, and pytorch? That could help us identify the problem.

wq2012 commented 5 years ago

Tried a few times. Cannot replicate the problem yet :(

77281900000 commented 5 years ago

Hi, I'm using python 3.5.2,numpy1.15.1,pytorch0.4.0

At 2018-12-21 23:20:55, "Quan Wang" notifications@github.com wrote:

Hi @gaodihe , could you let us know your versions of python, numpy, and pytorch? That could help us identify the problem.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

77281900000 commented 5 years ago

Acturally I cannot replicate it neither.After tried a few times,the program could run correctly with accuracy 1.The problem seems disappear,but run_test.sh problem still exist.And this run_test could run correctly in another file path(I think there is no different except some data) .

At 2018-12-21 23:46:27, "Quan Wang" notifications@github.com wrote:

Tried a few times. Cannot replicate the problem yet :(

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

wq2012 commented 5 years ago

@gaodihe Thanks for your information!

I just created a new issue: https://github.com/google/uis-rnn/issues/16

Once this is done, I may need your help to re-run the tests with a high verbosity value and share the loggings with us.

Currently we don't have sufficient information to debug this.

wq2012 commented 5 years ago

@gaodihe Actually, even before I resolve that bug, could you share with me the full STDOUT information of your failing test?

77281900000 commented 5 years ago

....

Ran 4 tests in 0.001s

OK F

FAIL: test_four_clusters (main.TestIntegration) Four clusters on vertices of a square.

Traceback (most recent call last): File "tests/integration_test.py", line 99, in test_four_clusters self.assertEqual(1.0, accuracy) AssertionError: 1.0 != 0.9


Ran 1 test in 15.784s

FAILED (failures=1) gaodihe@gaodihe-All-Series:~/PycharmProjects/uis-rnn$ ./run_tests.sh >1.txt ....

Ran 4 tests in 0.001s

OK ^CTraceback (most recent call last): File "tests/integration_test.py", line 115, in unittest.main() File "/usr/lib/python3.5/unittest/main.py", line 94, in init self.runTests() File "/usr/lib/python3.5/unittest/main.py", line 255, in runTests self.result = testRunner.run(self.test) File "/usr/lib/python3.5/unittest/runner.py", line 176, in run test(result) File "/usr/lib/python3.5/unittest/suite.py", line 84, in call return self.run(*args, kwds) File "/usr/lib/python3.5/unittest/suite.py", line 122, in run test(result) File "/usr/lib/python3.5/unittest/suite.py", line 84, in call return self.run(*args, *kwds) File "/usr/lib/python3.5/unittest/suite.py", line 122, in run test(result) File "/usr/lib/python3.5/unittest/case.py", line 648, in call return self.run(args, kwds) File "/usr/lib/python3.5/unittest/case.py", line 600, in run testMethod() File "tests/integration_test.py", line 89, in test_four_clusters model.fit(train_sequence, train_cluster_id, trainingargs) File "/home/gaodihe/PycharmProjects/uis-rnn/model/uisrnn.py", line 250, in fit mean, = self.rnn_model(packed_train_sequence, hidden) File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/module.py", line 491, in call result = self.forward(*input, kwargs) File "/home/gaodihe/PycharmProjects/uis-rnn/model/uisrnn.py", line 45, in forward output_seq, hidden = self.gru(input_seq, hidden) File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/module.py", line 491, in call result = self.forward(*input, *kwargs) File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/rnn.py", line 192, in forward output, hidden = func(input, self.all_weights, hx, batch_sizes) File "/usr/local/lib/python3.5/dist-packages/torch/nn/_functions/rnn.py", line 323, in forward return func(input, fargs, fkwargs) File "/usr/local/lib/python3.5/dist-packages/torch/nn/_functions/rnn.py", line 287, in forward dropout_ts) KeyboardInterrupt gaodihe@gaodihe-All-Series:~/PycharmProjects/uis-rnn$ ./run_tests.sh ~/PycharmProjects/uis-rnn ~/PycharmProjects/uis-rnn Running tests in tests/utils_test.py ....

Ran 4 tests in 0.001s

OK Running tests in tests/integration_test.py Iter: 0 Training Loss: 4.1982
Negative Log Likelihood: 6.4964 Sigma2 Prior: -2.2984 Regularization: 0.0002 Iter: 10 Training Loss: -0.1935
Negative Log Likelihood: 1.5341 Sigma2 Prior: -1.7278 Regularization: 0.0002 Iter: 20 Training Loss: -0.8121
Negative Log Likelihood: 0.7201 Sigma2 Prior: -1.5324 Regularization: 0.0002 Iter: 30 Training Loss: -1.0152
Negative Log Likelihood: 0.4823 Sigma2 Prior: -1.4977 Regularization: 0.0002 Iter: 40 Training Loss: -1.0503
Negative Log Likelihood: 0.4961 Sigma2 Prior: -1.5466 Regularization: 0.0002 Changing learning rate to: 0.005 Iter: 50 Training Loss: -1.3244
Negative Log Likelihood: 0.3421 Sigma2 Prior: -1.6667 Regularization: 0.0002 Iter: 60 Training Loss: -1.5849
Negative Log Likelihood: 0.3386 Sigma2 Prior: -1.9238 Regularization: 0.0002 Iter: 70 Training Loss: -2.2978
Negative Log Likelihood: 1.0629 Sigma2 Prior: -3.3610 Regularization: 0.0002 Iter: 80 Training Loss: -2.6359
Negative Log Likelihood: 0.4483 Sigma2 Prior: -3.0844 Regularization: 0.0002 Iter: 90 Training Loss: -2.3275
Negative Log Likelihood: 0.1928 Sigma2 Prior: -2.5205 Regularization: 0.0002 Changing learning rate to: 0.0025 Iter: 100 Training Loss: -2.8464
Negative Log Likelihood: 0.1297 Sigma2 Prior: -2.9763 Regularization: 0.0003 Iter: 110 Training Loss: -2.5952
Negative Log Likelihood: 0.0849 Sigma2 Prior: -2.6804 Regularization: 0.0003 Iter: 120 Training Loss: -2.6835
Negative Log Likelihood: 0.0827 Sigma2 Prior: -2.7664 Regularization: 0.0003 Iter: 130 Training Loss: -3.3645
Negative Log Likelihood: 0.9887 Sigma2 Prior: -4.3535 Regularization: 0.0003 Iter: 140 Training Loss: -3.6595
Negative Log Likelihood: 0.1904 Sigma2 Prior: -3.8502 Regularization: 0.0003 Changing learning rate to: 0.00125 Iter: 150 Training Loss: -3.9500
Negative Log Likelihood: 0.1933 Sigma2 Prior: -4.1435 Regularization: 0.0003 Iter: 160 Training Loss: -3.5048
Negative Log Likelihood: 0.1096 Sigma2 Prior: -3.6147 Regularization: 0.0003 Iter: 170 Training Loss: -3.2753
Negative Log Likelihood: 0.4618 Sigma2 Prior: -3.7374 Regularization: 0.0003 Iter: 180 Training Loss: -3.5798
Negative Log Likelihood: 0.4441 Sigma2 Prior: -4.0242 Regularization: 0.0003 Iter: 190 Training Loss: -3.4913
Negative Log Likelihood: 0.4049 Sigma2 Prior: -3.8965 Regularization: 0.0003 Done training with 200 iterations F

FAIL: test_four_clusters (main.TestIntegration) Four clusters on vertices of a square.

Traceback (most recent call last): File "tests/integration_test.py", line 99, in test_four_clusters self.assertEqual(1.0, accuracy) AssertionError: 1.0 != 0.9


Ran 1 test in 16.022s

wq2012 commented 5 years ago

My initial guess is that the network simply didn't converge to a good point at the end of training.

0.9 is still a high accuracy, though we were expecting 1.0.

A few things to try to validate this:

In general this issue could be avoided by training multiple networks in parallel and pick the best one.

But there is also space for us to improve the training process and the default arguments to make the training more robust and efficient.

@AnzCol Please take a look at this to see if you have any thoughts.

77281900000 commented 5 years ago

Yes,I have tried your reslosutions.Both of them could solve this problem.Thanks for your help.

wq2012 commented 5 years ago

@gaodihe Thanks for trying it. It's very helpful!

It basically validated that the failure is due to unsuccessful training.

In practice we usually have much more training steps than the unit/integration tests. The purpose of the tests is to validate code correctness, and we often run it after a small code change, so we prefer to use less steps to make it fast instead of stable.