k2-fsa / k2

FSA/FST algorithms, differentiable, with PyTorch compatibility.
https://k2-fsa.github.io/k2
Apache License 2.0
1.08k stars 211 forks source link

Error in rnnt_loss_test_py #1160

Open jtrmal opened 1 year ago

jtrmal commented 1 year ago
$ CUDA_VISIBLE_DEVICES=0 ctest --rerun-failed --output-on-failure
Test project /home/jtrmal/projects/k2/build_debug
    Start 97: rnnt_loss_test_py
1/1 Test #97: rnnt_loss_test_py ................***Failed    3.42 sec
..F.Pruned with new ranges 2 : tensor([770.3518, 452.2017, 759.8420, 562.8477, 664.1088])
Pruned with old ranges 2 : tensor([770.2620, 451.0323, 758.0836, 564.5976, 664.1088])
Pruned with new ranges 7 : tensor([698.3528, 427.7394, 720.2049, 538.8741, 664.1088])
Pruned with old ranges 7 : tensor([695.5566, 427.7296, 719.7375, 534.9005, 664.1088])
Pruned with new ranges 12 : tensor([688.5497, 427.1318, 716.7771, 527.3295, 664.1088])
Pruned with old ranges 12 : tensor([688.5190, 427.1318, 716.7300, 527.1926, 664.1088])
Pruned with new ranges 17 : tensor([687.4325, 427.1087, 716.2193, 524.6537, 664.1088])
Pruned with old ranges 17 : tensor([687.4208, 427.1087, 716.2195, 524.6722, 664.1088])
Pruned with new ranges 2 : tensor([770.3518, 452.2017, 759.8420, 562.8477, 664.1088], device='cuda:0')
Pruned with old ranges 2 : tensor([770.2620, 451.0323, 758.0836, 564.5977, 664.1088], device='cuda:0')
Pruned with new ranges 7 : tensor([698.3528, 427.7394, 720.2049, 538.8741, 664.1088], device='cuda:0')
Pruned with old ranges 7 : tensor([695.5567, 427.7296, 719.7375, 534.9005, 664.1088], device='cuda:0')
Pruned with new ranges 12 : tensor([688.5497, 427.1318, 716.7771, 527.3295, 664.1088], device='cuda:0')
Pruned with old ranges 12 : tensor([688.5190, 427.1318, 716.7300, 527.1926, 664.1088], device='cuda:0')
Pruned with new ranges 17 : tensor([687.4325, 427.1087, 716.2193, 524.6537, 664.1088], device='cuda:0')
Pruned with old ranges 17 : tensor([687.4208, 427.1087, 716.2195, 524.6722, 664.1088], device='cuda:0')
Unpruned rnnt loss with regular rnnt : tensor([117.7035, 583.1506, 178.6128, 342.4715])
Pruned loss with range 2 : tensor([126.5516, 645.1305, 240.4490, 374.9182], dtype=torch.float64)
Pruned loss with range 7 : tensor([117.7035, 614.2900, 198.5655, 347.0386], dtype=torch.float64)
Pruned loss with range 12 : tensor([117.7035, 601.2673, 184.7332, 342.9748], dtype=torch.float64)
Pruned loss with range 17 : tensor([117.7035, 591.1936, 179.9721, 342.5152], dtype=torch.float64)
Pruned loss with range 22 : tensor([117.7035, 588.4237, 178.7730, 342.4716], dtype=torch.float64)
Pruned loss with range 27 : tensor([117.7035, 586.2511, 178.6456, 342.4716], dtype=torch.float64)
Pruned loss with range 32 : tensor([117.7035, 583.1505, 178.6393, 342.4716], dtype=torch.float64)
Pruned loss with range 37 : tensor([117.7035, 583.1505, 178.6138, 342.4716], dtype=torch.float64)
Pruned loss with range 42 : tensor([117.7035, 583.1505, 178.6129, 342.4716], dtype=torch.float64)
Pruned loss with range 47 : tensor([117.7035, 583.1505, 178.6129, 342.4716], dtype=torch.float64)
Unpruned rnnt loss with regular rnnt : tensor([117.7035, 583.1506, 178.6128, 342.4715], device='cuda:0')
Pruned loss with range 2 : tensor([126.5516, 645.1305, 240.4490, 374.9182], device='cuda:0',
       dtype=torch.float64)
Pruned loss with range 7 : tensor([117.7035, 614.2900, 198.5655, 347.0386], device='cuda:0',
       dtype=torch.float64)
Pruned loss with range 12 : tensor([117.7035, 601.2673, 184.7332, 342.9748], device='cuda:0',
       dtype=torch.float64)
Pruned loss with range 17 : tensor([117.7035, 591.1936, 179.9721, 342.5152], device='cuda:0',
       dtype=torch.float64)
Pruned loss with range 22 : tensor([117.7035, 588.4237, 178.7730, 342.4716], device='cuda:0',
       dtype=torch.float64)
Pruned loss with range 27 : tensor([117.7035, 586.2511, 178.6456, 342.4716], device='cuda:0',
       dtype=torch.float64)
Pruned loss with range 32 : tensor([117.7035, 583.1505, 178.6393, 342.4716], device='cuda:0',
       dtype=torch.float64)
Pruned loss with range 37 : tensor([117.7035, 583.1505, 178.6139, 342.4716], device='cuda:0',
       dtype=torch.float64)
Pruned loss with range 42 : tensor([117.7035, 583.1505, 178.6129, 342.4716], device='cuda:0',
       dtype=torch.float64)
Pruned loss with range 47 : tensor([117.7035, 583.1505, 178.6129, 342.4716], device='cuda:0',
       dtype=torch.float64)
Unpruned rnnt loss with modified rnnt : tensor([105.8454, 520.9167, 110.4065, 302.0487])
Pruned loss with range 2 : tensor([109.4327, 563.8055, 125.9352, 322.0668], dtype=torch.float64)
Pruned loss with range 7 : tensor([105.8454, 537.7171, 111.0337, 303.5203], dtype=torch.float64)
Pruned loss with range 12 : tensor([105.8454, 530.4149, 110.4277, 302.3482], dtype=torch.float64)
Pruned loss with range 17 : tensor([105.8454, 526.3236, 110.4066, 302.0535], dtype=torch.float64)
Pruned loss with range 22 : tensor([105.8454, 524.1243, 110.4065, 302.0488], dtype=torch.float64)
Pruned loss with range 27 : tensor([105.8454, 522.4050, 110.4065, 302.0488], dtype=torch.float64)
Pruned loss with range 32 : tensor([105.8454, 520.9166, 110.4065, 302.0488], dtype=torch.float64)
Pruned loss with range 37 : tensor([105.8454, 520.9166, 110.4065, 302.0488], dtype=torch.float64)
Pruned loss with range 42 : tensor([105.8454, 520.9166, 110.4065, 302.0488], dtype=torch.float64)
Pruned loss with range 47 : tensor([105.8454, 520.9166, 110.4065, 302.0488], dtype=torch.float64)
Unpruned rnnt loss with modified rnnt : tensor([105.8454, 520.9167, 110.4065, 302.0487], device='cuda:0')
Pruned loss with range 2 : tensor([109.4327, 563.8055, 125.9352, 322.0668], device='cuda:0',
       dtype=torch.float64)
Pruned loss with range 7 : tensor([105.8454, 537.7171, 111.0337, 303.5203], device='cuda:0',
       dtype=torch.float64)
Pruned loss with range 12 : tensor([105.8454, 530.4149, 110.4277, 302.3482], device='cuda:0',
       dtype=torch.float64)
Pruned loss with range 17 : tensor([105.8454, 526.3236, 110.4066, 302.0535], device='cuda:0',
       dtype=torch.float64)
Pruned loss with range 22 : tensor([105.8454, 524.1243, 110.4065, 302.0488], device='cuda:0',
       dtype=torch.float64)
Pruned loss with range 27 : tensor([105.8454, 522.4050, 110.4065, 302.0488], device='cuda:0',
       dtype=torch.float64)
Pruned loss with range 32 : tensor([105.8454, 520.9166, 110.4065, 302.0488], device='cuda:0',
       dtype=torch.float64)
Pruned loss with range 37 : tensor([105.8454, 520.9166, 110.4065, 302.0488], device='cuda:0',
       dtype=torch.float64)
Pruned loss with range 42 : tensor([105.8454, 520.9166, 110.4065, 302.0488], device='cuda:0',
       dtype=torch.float64)
Pruned loss with range 47 : tensor([105.8454, 520.9166, 110.4065, 302.0488], device='cuda:0',
       dtype=torch.float64)
Unpruned rnnt loss with constrained rnnt : tensor([118.3153, 590.9176, 210.3912, 346.9110])
Pruned loss with range 2 : tensor([125.0485, 637.4134, 236.0995, 368.7725], dtype=torch.float64)
Pruned loss with range 7 : tensor([118.3153, 610.0108, 211.3728, 348.9163], dtype=torch.float64)
Pruned loss with range 12 : tensor([118.3153, 602.3602, 210.4280, 347.3128], dtype=torch.float64)
Pruned loss with range 17 : tensor([118.3153, 596.3497, 210.3915, 346.9178], dtype=torch.float64)
Pruned loss with range 22 : tensor([118.3153, 594.3053, 210.3912, 346.9110], dtype=torch.float64)
Pruned loss with range 27 : tensor([118.3153, 592.7185, 210.3912, 346.9110], dtype=torch.float64)
Pruned loss with range 32 : tensor([118.3153, 590.9175, 210.3912, 346.9110], dtype=torch.float64)
Pruned loss with range 37 : tensor([118.3153, 590.9175, 210.3912, 346.9110], dtype=torch.float64)
Pruned loss with range 42 : tensor([118.3153, 590.9175, 210.3912, 346.9110], dtype=torch.float64)
Pruned loss with range 47 : tensor([118.3153, 590.9175, 210.3912, 346.9110], dtype=torch.float64)
Unpruned rnnt loss with constrained rnnt : tensor([118.3153, 590.9176, 210.3912, 346.9110], device='cuda:0')
Pruned loss with range 2 : tensor([125.0485, 637.4134, 236.0995, 368.7725], device='cuda:0',
       dtype=torch.float64)
Pruned loss with range 7 : tensor([118.3153, 610.0108, 211.3728, 348.9163], device='cuda:0',
       dtype=torch.float64)
Pruned loss with range 12 : tensor([118.3153, 602.3602, 210.4280, 347.3128], device='cuda:0',
       dtype=torch.float64)
Pruned loss with range 17 : tensor([118.3153, 596.3497, 210.3915, 346.9178], device='cuda:0',
       dtype=torch.float64)
Pruned loss with range 22 : tensor([118.3153, 594.3053, 210.3912, 346.9110], device='cuda:0',
       dtype=torch.float64)
Pruned loss with range 27 : tensor([118.3153, 592.7185, 210.3912, 346.9110], device='cuda:0',
       dtype=torch.float64)....
======================================================================
FAIL: test_rnnt_loss_empty_reference (__main__.TestRnntLoss)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/jtrmal/projects/k2/k2/python/tests/rnnt_loss_test.py", line 844, in test_rnnt_loss_empty_reference
    assert torch.allclose(m, expected.to(device))
AssertionError

----------------------------------------------------------------------
Ran 8 tests in 2.342s

FAILED (failures=1)

Pruned loss with range 32 : tensor([118.3153, 590.9175, 210.3912, 346.9110], device='cuda:0',
       dtype=torch.float64)
Pruned loss with range 37 : tensor([118.3153, 590.9175, 210.3912, 346.9110], device='cuda:0',
       dtype=torch.float64)
Pruned loss with range 42 : tensor([118.3153, 590.9175, 210.3912, 346.9110], device='cuda:0',
       dtype=torch.float64)
Pruned loss with range 47 : tensor([118.3153, 590.9175, 210.3912, 346.9110], device='cuda:0',
       dtype=torch.float64)
B = 2, T = 9, S = 2, C = 10
Unpruned rnnt loss with regular rnnt : tensor([22.1890, 13.5834])
Pruned loss with range 2 : tensor([22.1890, 14.5212], dtype=torch.float64)
Pruned loss with range 3 : tensor([22.1890, 13.5834], dtype=torch.float64)
Unpruned rnnt loss with regular rnnt : tensor([22.1890, 13.5834], device='cuda:0')
Pruned loss with range 2 : tensor([22.1890, 14.5212], device='cuda:0', dtype=torch.float64)
Pruned loss with range 3 : tensor([22.1890, 13.5834], device='cuda:0', dtype=torch.float64)
Unpruned rnnt loss with modified rnnt : tensor([19.7059,  9.4256])
Pruned loss with range 1 : tensor([21.3703, 11.4501], dtype=torch.float64)
Pruned loss with range 2 : tensor([19.7059,  9.7360], dtype=torch.float64)
Pruned loss with range 3 : tensor([19.7059,  9.4256], dtype=torch.float64)
Unpruned rnnt loss with modified rnnt : tensor([19.7059,  9.4256], device='cuda:0')
Pruned loss with range 1 : tensor([21.3703, 11.4501], device='cuda:0', dtype=torch.float64)
Pruned loss with range 2 : tensor([19.7059,  9.7360], device='cuda:0', dtype=torch.float64)
Pruned loss with range 3 : tensor([19.7059,  9.4256], device='cuda:0', dtype=torch.float64)
Unpruned rnnt loss with constrained rnnt : tensor([22.1890, 13.9861])
Pruned loss with range 1 : tensor([inf, inf], dtype=torch.float64)
Pruned loss with range 2 : tensor([22.1890, 14.4814], dtype=torch.float64)
Pruned loss with range 3 : tensor([22.1890, 13.9861], dtype=torch.float64)
Unpruned rnnt loss with constrained rnnt : tensor([22.1890, 13.9861], device='cuda:0')
Pruned loss with range 1 : tensor([inf, inf], device='cuda:0', dtype=torch.float64)
Pruned loss with range 2 : tensor([22.1890, 14.4814], device='cuda:0', dtype=torch.float64)
Pruned loss with range 3 : tensor([22.1890, 13.9861], device='cuda:0', dtype=torch.float64)

0% tests passed, 1 tests failed out of 1

Total Test time (real) =   3.43 sec

The following tests FAILED:
     97 - rnnt_loss_test_py (Failed)
Errors while running CTest

CUDA 11.7 CuDNN 8.7.0.84 happens both Release and Debug k2 from git master gcc: gcc (Debian 10.2.1-6) 10.2.1 20210110

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Jun__8_16:49:14_PDT_2022
Cuda compilation tools, release 11.7, V11.7.99
Build cuda_11.7.r11.7/compiler.31442593_0

any idea?

jtrmal commented 1 year ago

And sorry for being terse

csukuangfj commented 1 year ago

Could you change

assert torch.allclose(m, expected.to(device))

to

assert torch.allclose(m, expected.to(device)), (m - expected.to(device)).abs().max()

so that it prints out some information on assertion failure.

If the value is very small, e.g., 0.001, we can ignore it.

pzelasko commented 1 year ago

BTW torch.testing.assert_close prints out better diagnostic info (how many elements mismatch, by how much, etc)

jtrmal commented 1 year ago

fangjun's code gave me this:

======================================================================
FAIL: test_rnnt_loss_empty_reference (__main__.TestRnntLoss)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/jtrmal/projects/k2/k2/python/tests/rnnt_loss_test.py", line 845, in test_rnnt_loss_empty_reference
    assert torch.allclose(m, expected.to(device)), (
AssertionError: tensor(1.1028, device='cuda:0')

I tried also piotr's suggestion but that didn't provide any info, only threw AssertionError

======================================================================
FAIL: test_rnnt_loss_empty_reference (__main__.TestRnntLoss)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/jtrmal/projects/k2/k2/python/tests/rnnt_loss_test.py", line 848, in test_rnnt_loss_empty_reference
    assert torch.testing.assert_close(m, expected.to(device))
AssertionError

----------------------------------------------------------------------

The code was

assert torch.testing.assert_close(m, expected.to(device))
jtrmal commented 1 year ago

I tried this code:

                assert torch.allclose(m, expected.to(device)), (
                    m,
                    expected,
                    m - expected.to(device),
                )

and the output was

======================================================================
FAIL: test_rnnt_loss_empty_reference (__main__.TestRnntLoss)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/jtrmal/projects/k2/k2/python/tests/rnnt_loss_test.py", line 845, in test_rnnt_loss_empty_reference
    assert torch.allclose(m, expected.to(device)), (
AssertionError: (tensor([0.], device='cuda:0'), tensor([1.1028]), tensor([-1.1028], device='cuda:0'))

----------------------------------------------------------------------
jtrmal commented 1 year ago

However, if I do something like this

850                 assert torch.testing.assert_close(
851                     m,
852                     expected.to(device),
853                     check_layout=False,
854                     check_device=False,
855                     check_dtype=False,
856                 ), (m, expected.to(device), (m - expected.to(device)))

I get this output

======================================================================
FAIL: test_rnnt_loss_empty_reference (__main__.TestRnntLoss)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/jtrmal/projects/k2/k2/python/tests/rnnt_loss_test.py", line 850, in test_rnnt_loss_empty_reference
    assert torch.testing.assert_close(
AssertionError: (tensor([1.1028]), tensor([1.1028]), tensor([0.]))

----------------------------------------------------------------------
Ran 8 tests in 2.332s

I'm so confused

jtrmal commented 1 year ago

Does it look like some timing/kernel sync issue?

jtrmal commented 1 year ago
export K2_DISABLE_CHECKS=0
export K2_SYNC_KERNELS=1
export CUDA_LAUNCH_BLOCKING=1

didn't change the behavior, tho

jtrmal commented 1 year ago

it did succeed on CPU, I think:

$ CUDA_VISIBLE_DEVICES= ctest --rerun-failed --output-on-failure
Test project /home/jtrmal/projects/k2/build_debug
    Start 97: rnnt_loss_test_py
1/1 Test #97: rnnt_loss_test_py ................   Passed    1.52 sec

100% tests passed, 0 tests failed out of 1

Total Test time (real) =   1.53 sec
jtrmal commented 1 year ago

sorry for spamming :/

pzelasko commented 1 year ago

Please change assert torch.testing.assert_close( to torch.testing.assert_close(, the actual assertion is inside, then you'll see more info

jtrmal commented 1 year ago

ah!

jtrmal commented 1 year ago
======================================================================
FAIL: test_rnnt_loss_empty_reference (__main__.TestRnntLoss)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/jtrmal/projects/k2/k2/python/tests/rnnt_loss_test.py", line 850, in test_rnnt_loss_empty_reference
    torch.testing.assert_close(
  File "/home/jtrmal/.local/lib/python3.9/site-packages/torch/testing/_comparison.py", line 1342, in assert_close
    assert_equal(
  File "/home/jtrmal/.local/lib/python3.9/site-packages/torch/testing/_comparison.py", line 1093, in assert_equal
    raise error_metas[0].to_error(msg)
AssertionError: Tensor-likes are not close!

Mismatched elements: 1 / 1 (100.0%)
Greatest absolute difference: 1.1028387546539307 at index (0,) (up to 1e-05 allowed)
Greatest relative difference: 1.0 at index (0,) (up to 1.3e-06 allowed)

----------------------------------------------------------------------
Ran 8 tests in 4.031s

FAILED (failures=1)
pzelasko commented 1 year ago

Ooops, not helpful with a single element tensor 🙈

jtrmal commented 1 year ago

tested out cudnn 8.3, 8.6,8.8 and could reproduce on all three

danpovey commented 1 year ago

We only added the ability to have an empty reference fairly recently so it's possible it was never properly tested then. Looking at that code, "expected" only seems to be written to if the device is CPU. [EDIT: I see now that this is now it is supposed to work, it is in a loop over device.]

danpovey commented 1 year ago

If you have time, one thing you could help to debug with is this: in mutual_information.py line 393, after the following line

    # note, tot_probs is without grad.                                                                                                                                                                                     
    tot_probs = _k2.mutual_information_forward(px_tot, py_tot, boundary, p)

print out the value of p (p will get set by this function call). This may generate a lot of output before it crashes; direct to a file if you want.

jtrmal commented 1 year ago

it's 3090 will get in touch with Desh, if he can dig deeper than I can y.

On Fri, Feb 17, 2023 at 3:15 AM Daniel Povey @.***> wrote:

What is your hardware?

— Reply to this email directly, view it on GitHub https://github.com/k2-fsa/k2/issues/1160#issuecomment-1434279894, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACUKYX3SMYYSB2BPUAX7ZMDWX4XT7ANCNFSM6AAAAAAU5MAFDM . You are receiving this because you authored the thread.Message ID: @.***>

desh2608 commented 1 year ago

@jtrmal I just tried running on CLSP grid with GPU and it passes:

10:31 $ pytest k2/python/tests/rnnt_loss_test.py
====================================================================== test session starts =======================================================================
platform linux -- Python 3.8.12, pytest-5.4.3, py-1.11.0, pluggy-0.13.1
rootdir: /export/c07/draj/mini_scale_2022/k2
plugins: typeguard-2.13.3, anyio-3.5.0, hypothesis-5.41.2
collected 8 items                                                                                                                                                

k2/python/tests/rnnt_loss_test.py ........                                                                                                                 [100%]

======================================================================== warnings summary ========================================================================
/home/draj/anaconda3/envs/scale/lib/python3.8/site-packages/_pytest/stepwise.py:108
  /home/draj/anaconda3/envs/scale/lib/python3.8/site-packages/_pytest/stepwise.py:108: PytestCacheWarning: cache could not write path /export/c07/draj/mini_scale_2022/k2/.pytest_cache/v/cache/stepwise
    self.config.cache.set("cache/stepwise", [])

/home/draj/anaconda3/envs/scale/lib/python3.8/site-packages/_pytest/cacheprovider.py:366
  /home/draj/anaconda3/envs/scale/lib/python3.8/site-packages/_pytest/cacheprovider.py:366: PytestCacheWarning: cache could not write path /export/c07/draj/mini_scale_2022/k2/.pytest_cache/v/cache/nodeids
    config.cache.set("cache/nodeids", self.cached_nodeids)

/home/draj/anaconda3/envs/scale/lib/python3.8/site-packages/_pytest/cacheprovider.py:326
  /home/draj/anaconda3/envs/scale/lib/python3.8/site-packages/_pytest/cacheprovider.py:326: PytestCacheWarning: cache could not write path /export/c07/draj/mini_scale_2022/k2/.pytest_cache/v/cache/lastfailed
    config.cache.set("cache/lastfailed", self.lastfailed)

-- Docs: https://docs.pytest.org/en/latest/warnings.html
================================================================= 8 passed, 3 warnings in 41.40s =================================================================

(You can ignore the warnings --- the c07 node is read-only today due to some issues.)

danpovey commented 1 year ago

OK, might be something specific to Yenda's setup or where he is running it. @jtrmal can you please add that print statement that I mentioned above? (I edited it, you may not see it from email)

jtrmal commented 1 year ago

cpu.log gpu.log

I'm attaching logs from both cpu and gpu runs, obtained as

CUDA_VISIBLE_DEVICES=  ctest --rerun-failed       --verbose > cpu.log
CUDA_VISIBLE_DEVICES=0 ctest --rerun-failed       --verbose > gpu.log

CPU run succeeded, GPU failed Also, I had to modify the mutual_information.py file on line 160 -- the modification in line 397 didn't give any output (probably isn't called joint_mutual_information_recursion)