ledatelescope / bifrost

A stream processing framework for high-throughput applications.
BSD 3-Clause "New" or "Revised" License
64 stars 29 forks source link

Fail tests test_fft.TestFFT on Ubuntu 20.04 #180

Closed CedricDViou closed 2 years ago

CedricDViou commented 2 years ago

Hello, I'm starting to use bifrost and I'm happily starting with the tutorials. However, just out of curiosity, to check my install, I run make test. Many passed but FAILED (failures=3, skipped=4)

FAIL: test_c2r_1D (test_fft.TestFFT) Traceback (most recent call last): File "/home/cedric/tmp/bifrost/test/test_fft.py", line 206, in test_c2r_1D self.run_test_c2r(self.shape1D, [0]) File "/home/cedric/tmp/bifrost/test/test_fft.py", line 148, in run_test_c2r self.run_test_c2r_impl(shape, axes) File "/home/cedric/tmp/bifrost/test/test_fft.py", line 141, in run_test_c2r_impl compare(odata.copy('system'), known_result) File "/home/cedric/tmp/bifrost/test/test_fft.py", line 51, in compare np.testing.assert_allclose(result, gold, rtol=RTOL, atol=MTOL * absmean) File "/home/cedric/anaconda3/lib/python3.8/site-packages/numpy/testing/_private/utils.py", line 1528, in assert_allclose assert_array_compare(compare, actual, desired, err_msg=str(err_msg), File "/home/cedric/anaconda3/lib/python3.8/site-packages/numpy/testing/_private/utils.py", line 842, in assert_array_compare raise AssertionError(msg) AssertionError: Not equal to tolerance rtol=0.1, atol=0.00462357 Mismatched elements: 1 / 16777216 (5.96e-06%) Max absolute difference: 0.01639435 Max relative difference: 1.93911746

FAIL: test_c2r_2D (test_fft.TestFFT) AssertionError: Not equal to tolerance rtol=0.1, atol=0.00231149 Mismatched elements: 4186048 / 4194304 (99.8%) Max absolute difference: 492620.22392237 Max relative difference: 39550830.05759069

FAIL: test_c2r_3D (test_fft.TestFFT) AssertionError: Not equal to tolerance rtol=0.1, atol=0.00163087 Mismatched elements: 2080441 / 2097152 (99.2%) Max absolute difference: 88917.37869481 Max relative difference: 6823521.1182115

This was tested on Ubuntu 20.04.4 LTS with Python 3.8.8.

Tell me if I can help. Regards, Cedric

league commented 2 years ago

Hi Cedric, thanks for the report. We have sometimes seen failures such as these that are dependent on GPU card and architecture settings. Can you provide details about your GPU hardware (output of nvidia-smi for example) and the output of ./configure (or the config.log file)?

CedricDViou commented 2 years ago

Thanks for the quick feed back.

| NVIDIA-SMI 515.43.04    Driver Version: 515.43.04    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro T2000        On   | 00000000:01:00.0 Off |                  N/A |
| N/A   40C    P0    17W /  N/A |     10MiB /  4096MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

configure_stdout.txt config.log

I hope this helps.

league commented 2 years ago

It's just these 3 failures? They're all using complex-to-real transforms (which I failed to notice when looking at it on my phone this morning), and there is a known issue with C2R in cufft on certain cards and/or certain CUDA versions, so I guess we're hitting it here. I'll gather together some possibly-related info we've stumbled across… hope we can find a work-around for this one, but Jayce will know more.

CedricDViou commented 2 years ago

Yes, just these 3 failures. I guess my install is mostly fine then and that I can play with the tutorials. Thanks for your feedback.