ihaque / memtestG80

CUDA-based memory tester for NVIDIA GPUs
Other
79 stars 17 forks source link

Random blocks test fails on over 16400 MiB memory #7

Open jsc2 opened 5 years ago

jsc2 commented 5 years ago

I have just tried testing on two new Quadro P6000 cards. Both return the same errors on tests while testing memory over 16400 MiB.

Following are results of 16400 passing, 16401 failing and 20000 failing:

./memtestG80 16400 1

 -------------------------------------------------------------
 |                      MemtestG80 v1.00                     |
 |                                                           |
 | Usage: memtestG80 [flags] [MB GPU RAM to test] [# iters]  |
 |                                                           |
 | Defaults: GPU 0, 128MB RAM, 50 test iterations            |
 | Amount of tested RAM will be rounded up to nearest 2MB    |
 -------------------------------------------------------------

  Available flags:
    --gpu N ,-g N : run test on the Nth (from 0) CUDA GPU
    --license ,-l : show license terms for this build

Running 1 iterations of tests over 16400 MB of GPU memory on card 0: Quadro P6000

Running memory bandwidth test over 20 iterations of 8200 MB transfers... Estimated bandwidth 328000000.00 MB/s

Test iteration 1 (GPU 0, 16400 MiB): 0 errors so far Moving Inversions (ones and zeros): 0 errors (256 ms) Memtest86 Walking 8-bit: 0 errors (2049 ms) True Walking zeros (8-bit): 0 errors (1011 ms) True Walking ones (8-bit): 0 errors (1012 ms) Moving Inversions (random): 0 errors (258 ms) Memtest86 Walking zeros (32-bit): 0 errors (4050 ms) Memtest86 Walking ones (32-bit): 0 errors (4051 ms) Random blocks: 0 errors (456 ms) Memtest86 Modulo-20: 0 errors (23933 ms) Logic (one iteration): 0 errors (129 ms) Logic (4 iterations): 0 errors (130 ms) Logic (shared memory, one iteration): 0 errors (129 ms) Logic (shared-memory, 4 iterations): 0 errors (130 ms)

Final error count after 1 iterations over 16400 MiB of GPU memory: 0 errors

./memtestG80 16401 1

 -------------------------------------------------------------
 |                      MemtestG80 v1.00                     |
 |                                                           |
 | Usage: memtestG80 [flags] [MB GPU RAM to test] [# iters]  |
 |                                                           |
 | Defaults: GPU 0, 128MB RAM, 50 test iterations            |
 | Amount of tested RAM will be rounded up to nearest 2MB    |
 -------------------------------------------------------------

  Available flags:
    --gpu N ,-g N : run test on the Nth (from 0) CUDA GPU
    --license ,-l : show license terms for this build

Running 1 iterations of tests over 16402 MB of GPU memory on card 0: Quadro P6000

Running memory bandwidth test over 20 iterations of 8201 MB transfers... Estimated bandwidth 328040000.00 MB/s

Test iteration 1 (GPU 0, 16402 MiB): 0 errors so far Moving Inversions (ones and zeros): 0 errors (257 ms) Memtest86 Walking 8-bit: 0 errors (2052 ms) True Walking zeros (8-bit): 0 errors (1010 ms) True Walking ones (8-bit): 0 errors (1014 ms) Moving Inversions (random): 0 errors (257 ms) Memtest86 Walking zeros (32-bit): 0 errors (4050 ms) Memtest86 Walking ones (32-bit): 0 errors (4051 ms) Random blocks: 67198032 errors (457 ms) Memtest86 Modulo-20: 0 errors (23952 ms) Logic (one iteration): 0 errors (128 ms) Logic (4 iterations): 0 errors (130 ms) Logic (shared memory, one iteration): 0 errors (129 ms) Logic (shared-memory, 4 iterations): 0 errors (130 ms)

Final error count after 1 iterations over 16402 MiB of GPU memory: 67198032 errors

./memtestG80 20000 1

 -------------------------------------------------------------
 |                      MemtestG80 v1.00                     |
 |                                                           |
 | Usage: memtestG80 [flags] [MB GPU RAM to test] [# iters]  |
 |                                                           |
 | Defaults: GPU 0, 128MB RAM, 50 test iterations            |
 | Amount of tested RAM will be rounded up to nearest 2MB    |
 -------------------------------------------------------------

  Available flags:
    --gpu N ,-g N : run test on the Nth (from 0) CUDA GPU
    --license ,-l : show license terms for this build

Running 1 iterations of tests over 20000 MB of GPU memory on card 0: Quadro P6000

Running memory bandwidth test over 20 iterations of 10000 MB transfers... Estimated bandwidth 2030456.85 MB/s

Test iteration 1 (GPU 0, 20000 MiB): 0 errors so far Moving Inversions (ones and zeros): 0 errors (313 ms) Memtest86 Walking 8-bit: 0 errors (2499 ms) True Walking zeros (8-bit): 0 errors (1232 ms) True Walking ones (8-bit): 0 errors (1234 ms) Moving Inversions (random): 0 errors (314 ms) Memtest86 Walking zeros (32-bit): 0 errors (4932 ms) Memtest86 Walking ones (32-bit): 0 errors (4933 ms) Random blocks: 2270811672 errors (557 ms) Memtest86 Modulo-20: 0 errors (29190 ms) Logic (one iteration): 0 errors (157 ms) Logic (4 iterations): 0 errors (158 ms) Logic (shared memory, one iteration): 0 errors (157 ms) Logic (shared-memory, 4 iterations): 0 errors (157 ms)

Final error count after 1 iterations over 20000 MiB of GPU memory: 2270811672 errors

The number of errors are the same for each card. All other tests pass which makes me think this is a bug and not a failure of the card.

This is a great tool and has helped me find GPUs with problems. Thank you

ihaque commented 5 years ago

My guess is that this is related to #3. When I originally wrote this tool, no GPUs had even 4GB of memory, so there may be 32-bitness issues sitting around.

Unfortunately I haven't been actively working on this tool for over 5 years now (and I don't have a GPU with so much RAM), so I won't be able to fix this myself. If someone is interested in submitting a pull request with a fix I'd be happy to merge it, though.

ihaque commented 5 years ago

It appears there may be a bug with random blocks that is separate from the memory size issue:

https://forums.geforce.com/default/topic/1080529/rtx-strix-2080-errors-with-memtestg80-help/

My guess would be some kind of synchronization/warp size issue on newer GPUs, though it's also possible that the code is making assumptions about pointer size that are no longer true.