Run MNIST example with WGPU generates invalid model file

J-F-Liu commented 8 months ago

Pull lateast code, run cargo run --example mnist --release --features wgpu

The training accuracy drops around iteration 90.
Examining the generated model.bin file, mostly dummy data

My CPU is AMD Ryzen 7 6800H with Radeon Graphics, this bug exists also before upgrade to wgpu 0.19.

nathanielsimard commented 8 months ago

Hmm, I can't reproduce the problem on my nvidia card. Could you run the wgpu test suite on your GPU to see if an operation fails? You can simply run cargo test in the burn-wgpu directory.

J-F-Liu commented 8 months ago

Yes, here is result:

failures:

---- fusion::base::tests::maxmin::tests::test_mean_dim_2d stdout ----
thread 'fusion::base::tests::maxmin::tests::test_mean_dim_2d' panicked at burn-wgpu\src\fusion\base.rs:187:5:
assertion `left == right` failed
  left: Data { value: [1.0, 4.0], shape: Shape { dims: [2, 1] } }
 right: Data { value: [0.99999994, 3.9999998], shape: Shape { dims: [2, 1] } }

---- kernel::matmul::tiling2d::unpadded::tests::test_matmul_irregular_shape stdout ----
thread 'kernel::matmul::tiling2d::unpadded::tests::test_matmul_irregular_shape' panicked at burn-wgpu\src\kernel\matmul\utils.rs:65:33:
Tensors are not approx eq:
  => Position 22372: 4.402349472045898 != 4.3502044677734375 | difference 0.05214500427246094 > tolerance 0.0010000000000000002
  => Position 22373: -0.9585940837860107 != -1.0070807933807373 | difference 0.04848670959472656 > tolerance 0.0010000000000000002
  => Position 22374: -8.618410110473633 != -9.033252716064453 | difference 0.4148426055908203 > tolerance 0.0010000000000000002
  => Position 22375: 4.302424907684326 != 4.226462364196777 | difference 0.07596254348754883 > tolerance 0.0010000000000000002
  => Position 22376: 5.406569004058838 != 5.009387016296387 | difference 0.39718198776245117 > tolerance 0.0010000000000000002
11085 more errors...

---- kernel::prng::normal::tests::empirical_mean_close_to_expectation stdout ----
thread 'kernel::prng::normal::tests::empirical_mean_close_to_expectation' panicked at burn-wgpu\src\kernel\prng\normal.rs:93:24:
Tensors are not approx eq:
  => Position 0: 8.946138381958008 != 10 | difference 1.0538616180419922 > tolerance 0.1

---- kernel::reduce::reduction_shared_memory::tests::reduction_sum_dim_shared_memory_small stdout ----
thread 'kernel::reduce::reduction_shared_memory::tests::reduction_sum_dim_shared_memory_small' panicked at burn-wgpu\src\kernel\reduce\reduction_shared_memory.rs:136:29:
Tensors are not approx eq:
  => Position 0: 351.03289794921875 != 288.3531799316406 | difference 62.679718017578125 > tolerance 0.0010000000000000002

---- kernel::reduce::reduction_shared_memory::tests::reduction_sum_dim_shared_memory_large stdout ----
thread 'kernel::reduce::reduction_shared_memory::tests::reduction_sum_dim_shared_memory_large' panicked at burn-wgpu\src\kernel\reduce\reduction_shared_memory.rs:177:29:
Tensors are not approx eq:
  => Position 684: 22.973115921020508 != 17.27593421936035 | difference 5.697181701660156 > tolerance 0.0010000000000000002
  => Position 685: 25.75684928894043 != 17.05587387084961 | difference 8.70097541809082 > tolerance 0.0010000000000000002
  => Position 686: 24.88041114807129 != 21.817140579223633 | difference 3.0632705688476563 > tolerance 0.0010000000000000002
  => Position 687: 25.581012725830078 != 21.639711380004883 | difference 3.9413013458251953 > tolerance 0.0010000000000000002
  => Position 688: 24.266672134399414 != 23.075439453125 | difference 1.191232681274414 > tolerance 0.0010000000000000002
20 more errors...

---- kernel::reduce::reduction::tests::reduction_sum_should_work_with_multiple_invocations stdout ----
thread 'kernel::reduce::reduction::tests::reduction_sum_should_work_with_multiple_invocations' panicked at burn-wgpu\src\kernel\reduce\reduction.rs:193:29:
Tensors are not approx eq:
  => Position 0: 763.541748046875 != 634.2994384765625 | difference 129.2423095703125 > tolerance 0.0010000000000000002

---- tests::maxmin::tests::test_mean_dim_2d stdout ----
thread 'tests::maxmin::tests::test_mean_dim_2d' panicked at burn-wgpu\src\lib.rs:49:5:
assertion `left == right` failed
  left: Data { value: [1.0, 4.0], shape: Shape { dims: [2, 1] } }
 right: Data { value: [0.99999994, 3.9999998], shape: Shape { dims: [2, 1] } }

failures:
    fusion::base::tests::maxmin::tests::test_mean_dim_2d
    kernel::matmul::tiling2d::unpadded::tests::test_matmul_irregular_shape
    kernel::prng::normal::tests::empirical_mean_close_to_expectation
    kernel::reduce::reduction::tests::reduction_sum_should_work_with_multiple_invocations
    kernel::reduce::reduction_shared_memory::tests::reduction_sum_dim_shared_memory_large
    kernel::reduce::reduction_shared_memory::tests::reduction_sum_dim_shared_memory_small
    tests::maxmin::tests::test_mean_dim_2d

test result: FAILED. 1241 passed; 7 failed; 0 ignored; 0 measured; 0 filtered out; finished in 27.42s

bionicles commented 6 months ago

noticed while testing the mnist example, i can't seem to get wgpu backend to even use the gpu at all:

ran this test 3x, and there seems to only be 3 cpu spikes, the earlier gpu spike seems unrelated to invoking 'cargo test' within 'burn/crates/burn-wgpu'

system: i9-13900k cpu, 64gb ram, LSB_RELEASE: Ubuntu 22.04.3 LTS

nvidia-smi Mon Mar 11 18:33:57 2024 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 550.60.01 Driver Version: 551.76 CUDA Version: 12.4 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 4090 On | 00000000:01:00.0 On | Off | | 0% 47C P5 62W / 450W | 1173MiB / 24564MiB | 1% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | No running processes found | +-----------------------------------------------------------------------------------------+

nvcc -V nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2023 NVIDIA Corporation Built on Wed_Nov_22_10:17:15_PST_2023 Cuda compilation tools, release 12.3, V12.3.107 Build cuda_12.3.r12.3/compiler.33567101_0

any ideas why wgpu wouldn't use gpu?

tracel-ai / burn

Run MNIST example with WGPU generates invalid model file #1171