Open jtrmal opened 1 year ago
And sorry for being terse
Could you change
assert torch.allclose(m, expected.to(device))
to
assert torch.allclose(m, expected.to(device)), (m - expected.to(device)).abs().max()
so that it prints out some information on assertion failure.
If the value is very small, e.g., 0.001, we can ignore it.
BTW torch.testing.assert_close prints out better diagnostic info (how many elements mismatch, by how much, etc)
fangjun's code gave me this:
======================================================================
FAIL: test_rnnt_loss_empty_reference (__main__.TestRnntLoss)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/jtrmal/projects/k2/k2/python/tests/rnnt_loss_test.py", line 845, in test_rnnt_loss_empty_reference
assert torch.allclose(m, expected.to(device)), (
AssertionError: tensor(1.1028, device='cuda:0')
I tried also piotr's suggestion but that didn't provide any info, only threw AssertionError
======================================================================
FAIL: test_rnnt_loss_empty_reference (__main__.TestRnntLoss)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/jtrmal/projects/k2/k2/python/tests/rnnt_loss_test.py", line 848, in test_rnnt_loss_empty_reference
assert torch.testing.assert_close(m, expected.to(device))
AssertionError
----------------------------------------------------------------------
The code was
assert torch.testing.assert_close(m, expected.to(device))
I tried this code:
assert torch.allclose(m, expected.to(device)), (
m,
expected,
m - expected.to(device),
)
and the output was
======================================================================
FAIL: test_rnnt_loss_empty_reference (__main__.TestRnntLoss)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/jtrmal/projects/k2/k2/python/tests/rnnt_loss_test.py", line 845, in test_rnnt_loss_empty_reference
assert torch.allclose(m, expected.to(device)), (
AssertionError: (tensor([0.], device='cuda:0'), tensor([1.1028]), tensor([-1.1028], device='cuda:0'))
----------------------------------------------------------------------
However, if I do something like this
850 assert torch.testing.assert_close(
851 m,
852 expected.to(device),
853 check_layout=False,
854 check_device=False,
855 check_dtype=False,
856 ), (m, expected.to(device), (m - expected.to(device)))
I get this output
======================================================================
FAIL: test_rnnt_loss_empty_reference (__main__.TestRnntLoss)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/jtrmal/projects/k2/k2/python/tests/rnnt_loss_test.py", line 850, in test_rnnt_loss_empty_reference
assert torch.testing.assert_close(
AssertionError: (tensor([1.1028]), tensor([1.1028]), tensor([0.]))
----------------------------------------------------------------------
Ran 8 tests in 2.332s
I'm so confused
Does it look like some timing/kernel sync issue?
export K2_DISABLE_CHECKS=0
export K2_SYNC_KERNELS=1
export CUDA_LAUNCH_BLOCKING=1
didn't change the behavior, tho
it did succeed on CPU, I think:
$ CUDA_VISIBLE_DEVICES= ctest --rerun-failed --output-on-failure
Test project /home/jtrmal/projects/k2/build_debug
Start 97: rnnt_loss_test_py
1/1 Test #97: rnnt_loss_test_py ................ Passed 1.52 sec
100% tests passed, 0 tests failed out of 1
Total Test time (real) = 1.53 sec
sorry for spamming :/
Please change assert torch.testing.assert_close(
to torch.testing.assert_close(
, the actual assertion is inside, then you'll see more info
ah!
======================================================================
FAIL: test_rnnt_loss_empty_reference (__main__.TestRnntLoss)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/jtrmal/projects/k2/k2/python/tests/rnnt_loss_test.py", line 850, in test_rnnt_loss_empty_reference
torch.testing.assert_close(
File "/home/jtrmal/.local/lib/python3.9/site-packages/torch/testing/_comparison.py", line 1342, in assert_close
assert_equal(
File "/home/jtrmal/.local/lib/python3.9/site-packages/torch/testing/_comparison.py", line 1093, in assert_equal
raise error_metas[0].to_error(msg)
AssertionError: Tensor-likes are not close!
Mismatched elements: 1 / 1 (100.0%)
Greatest absolute difference: 1.1028387546539307 at index (0,) (up to 1e-05 allowed)
Greatest relative difference: 1.0 at index (0,) (up to 1.3e-06 allowed)
----------------------------------------------------------------------
Ran 8 tests in 4.031s
FAILED (failures=1)
Ooops, not helpful with a single element tensor 🙈
tested out cudnn 8.3, 8.6,8.8 and could reproduce on all three
We only added the ability to have an empty reference fairly recently so it's possible it was never properly tested then. Looking at that code, "expected" only seems to be written to if the device is CPU. [EDIT: I see now that this is now it is supposed to work, it is in a loop over device.]
If you have time, one thing you could help to debug with is this: in mutual_information.py line 393, after the following line
# note, tot_probs is without grad.
tot_probs = _k2.mutual_information_forward(px_tot, py_tot, boundary, p)
print out the value of p (p will get set by this function call). This may generate a lot of output before it crashes; direct to a file if you want.
it's 3090 will get in touch with Desh, if he can dig deeper than I can y.
On Fri, Feb 17, 2023 at 3:15 AM Daniel Povey @.***> wrote:
What is your hardware?
— Reply to this email directly, view it on GitHub https://github.com/k2-fsa/k2/issues/1160#issuecomment-1434279894, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACUKYX3SMYYSB2BPUAX7ZMDWX4XT7ANCNFSM6AAAAAAU5MAFDM . You are receiving this because you authored the thread.Message ID: @.***>
@jtrmal I just tried running on CLSP grid with GPU and it passes:
10:31 $ pytest k2/python/tests/rnnt_loss_test.py
====================================================================== test session starts =======================================================================
platform linux -- Python 3.8.12, pytest-5.4.3, py-1.11.0, pluggy-0.13.1
rootdir: /export/c07/draj/mini_scale_2022/k2
plugins: typeguard-2.13.3, anyio-3.5.0, hypothesis-5.41.2
collected 8 items
k2/python/tests/rnnt_loss_test.py ........ [100%]
======================================================================== warnings summary ========================================================================
/home/draj/anaconda3/envs/scale/lib/python3.8/site-packages/_pytest/stepwise.py:108
/home/draj/anaconda3/envs/scale/lib/python3.8/site-packages/_pytest/stepwise.py:108: PytestCacheWarning: cache could not write path /export/c07/draj/mini_scale_2022/k2/.pytest_cache/v/cache/stepwise
self.config.cache.set("cache/stepwise", [])
/home/draj/anaconda3/envs/scale/lib/python3.8/site-packages/_pytest/cacheprovider.py:366
/home/draj/anaconda3/envs/scale/lib/python3.8/site-packages/_pytest/cacheprovider.py:366: PytestCacheWarning: cache could not write path /export/c07/draj/mini_scale_2022/k2/.pytest_cache/v/cache/nodeids
config.cache.set("cache/nodeids", self.cached_nodeids)
/home/draj/anaconda3/envs/scale/lib/python3.8/site-packages/_pytest/cacheprovider.py:326
/home/draj/anaconda3/envs/scale/lib/python3.8/site-packages/_pytest/cacheprovider.py:326: PytestCacheWarning: cache could not write path /export/c07/draj/mini_scale_2022/k2/.pytest_cache/v/cache/lastfailed
config.cache.set("cache/lastfailed", self.lastfailed)
-- Docs: https://docs.pytest.org/en/latest/warnings.html
================================================================= 8 passed, 3 warnings in 41.40s =================================================================
(You can ignore the warnings --- the c07 node is read-only today due to some issues.)
OK, might be something specific to Yenda's setup or where he is running it. @jtrmal can you please add that print statement that I mentioned above? (I edited it, you may not see it from email)
I'm attaching logs from both cpu and gpu runs, obtained as
CUDA_VISIBLE_DEVICES= ctest --rerun-failed --verbose > cpu.log
CUDA_VISIBLE_DEVICES=0 ctest --rerun-failed --verbose > gpu.log
CPU run succeeded, GPU failed
Also, I had to modify the mutual_information.py file on line 160 -- the modification in line 397 didn't give any output (probably isn't called joint_mutual_information_recursion
)
CUDA 11.7 CuDNN 8.7.0.84 happens both Release and Debug k2 from git master gcc: gcc (Debian 10.2.1-6) 10.2.1 20210110
any idea?