tracel-ai / burn

Burn is a new comprehensive dynamic Deep Learning Framework built using Rust with extreme flexibility, compute efficiency and portability as its primary goals.
https://burn.dev
Apache License 2.0
8.53k stars 422 forks source link

Run MNIST example with WGPU generates invalid model file #1171

Open J-F-Liu opened 8 months ago

J-F-Liu commented 8 months ago

Pull lateast code, run cargo run --example mnist --release --features wgpu

  1. The training accuracy drops around iteration 90. Train MNIST 2024-01-24 174400
  2. Examining the generated model.bin file, mostly dummy data image

My CPU is AMD Ryzen 7 6800H with Radeon Graphics, this bug exists also before upgrade to wgpu 0.19.

nathanielsimard commented 8 months ago

Hmm, I can't reproduce the problem on my nvidia card. Could you run the wgpu test suite on your GPU to see if an operation fails? You can simply run cargo test in the burn-wgpu directory.

J-F-Liu commented 8 months ago

Yes, here is result:

failures:

---- fusion::base::tests::maxmin::tests::test_mean_dim_2d stdout ----
thread 'fusion::base::tests::maxmin::tests::test_mean_dim_2d' panicked at burn-wgpu\src\fusion\base.rs:187:5:
assertion `left == right` failed
  left: Data { value: [1.0, 4.0], shape: Shape { dims: [2, 1] } }
 right: Data { value: [0.99999994, 3.9999998], shape: Shape { dims: [2, 1] } }

---- kernel::matmul::tiling2d::unpadded::tests::test_matmul_irregular_shape stdout ----
thread 'kernel::matmul::tiling2d::unpadded::tests::test_matmul_irregular_shape' panicked at burn-wgpu\src\kernel\matmul\utils.rs:65:33:
Tensors are not approx eq:
  => Position 22372: 4.402349472045898 != 4.3502044677734375 | difference 0.05214500427246094 > tolerance 0.0010000000000000002
  => Position 22373: -0.9585940837860107 != -1.0070807933807373 | difference 0.04848670959472656 > tolerance 0.0010000000000000002
  => Position 22374: -8.618410110473633 != -9.033252716064453 | difference 0.4148426055908203 > tolerance 0.0010000000000000002
  => Position 22375: 4.302424907684326 != 4.226462364196777 | difference 0.07596254348754883 > tolerance 0.0010000000000000002
  => Position 22376: 5.406569004058838 != 5.009387016296387 | difference 0.39718198776245117 > tolerance 0.0010000000000000002
11085 more errors...

---- kernel::prng::normal::tests::empirical_mean_close_to_expectation stdout ----
thread 'kernel::prng::normal::tests::empirical_mean_close_to_expectation' panicked at burn-wgpu\src\kernel\prng\normal.rs:93:24:
Tensors are not approx eq:
  => Position 0: 8.946138381958008 != 10 | difference 1.0538616180419922 > tolerance 0.1

---- kernel::reduce::reduction_shared_memory::tests::reduction_sum_dim_shared_memory_small stdout ----
thread 'kernel::reduce::reduction_shared_memory::tests::reduction_sum_dim_shared_memory_small' panicked at burn-wgpu\src\kernel\reduce\reduction_shared_memory.rs:136:29:
Tensors are not approx eq:
  => Position 0: 351.03289794921875 != 288.3531799316406 | difference 62.679718017578125 > tolerance 0.0010000000000000002

---- kernel::reduce::reduction_shared_memory::tests::reduction_sum_dim_shared_memory_large stdout ----
thread 'kernel::reduce::reduction_shared_memory::tests::reduction_sum_dim_shared_memory_large' panicked at burn-wgpu\src\kernel\reduce\reduction_shared_memory.rs:177:29:
Tensors are not approx eq:
  => Position 684: 22.973115921020508 != 17.27593421936035 | difference 5.697181701660156 > tolerance 0.0010000000000000002
  => Position 685: 25.75684928894043 != 17.05587387084961 | difference 8.70097541809082 > tolerance 0.0010000000000000002
  => Position 686: 24.88041114807129 != 21.817140579223633 | difference 3.0632705688476563 > tolerance 0.0010000000000000002
  => Position 687: 25.581012725830078 != 21.639711380004883 | difference 3.9413013458251953 > tolerance 0.0010000000000000002
  => Position 688: 24.266672134399414 != 23.075439453125 | difference 1.191232681274414 > tolerance 0.0010000000000000002
20 more errors...

---- kernel::reduce::reduction::tests::reduction_sum_should_work_with_multiple_invocations stdout ----
thread 'kernel::reduce::reduction::tests::reduction_sum_should_work_with_multiple_invocations' panicked at burn-wgpu\src\kernel\reduce\reduction.rs:193:29:
Tensors are not approx eq:
  => Position 0: 763.541748046875 != 634.2994384765625 | difference 129.2423095703125 > tolerance 0.0010000000000000002

---- tests::maxmin::tests::test_mean_dim_2d stdout ----
thread 'tests::maxmin::tests::test_mean_dim_2d' panicked at burn-wgpu\src\lib.rs:49:5:
assertion `left == right` failed
  left: Data { value: [1.0, 4.0], shape: Shape { dims: [2, 1] } }
 right: Data { value: [0.99999994, 3.9999998], shape: Shape { dims: [2, 1] } }

failures:
    fusion::base::tests::maxmin::tests::test_mean_dim_2d
    kernel::matmul::tiling2d::unpadded::tests::test_matmul_irregular_shape
    kernel::prng::normal::tests::empirical_mean_close_to_expectation
    kernel::reduce::reduction::tests::reduction_sum_should_work_with_multiple_invocations
    kernel::reduce::reduction_shared_memory::tests::reduction_sum_dim_shared_memory_large
    kernel::reduce::reduction_shared_memory::tests::reduction_sum_dim_shared_memory_small
    tests::maxmin::tests::test_mean_dim_2d

test result: FAILED. 1241 passed; 7 failed; 0 ignored; 0 measured; 0 filtered out; finished in 27.42s
bionicles commented 6 months ago

noticed while testing the mnist example, i can't seem to get wgpu backend to even use the gpu at all:

image

ran this test 3x, and there seems to only be 3 cpu spikes, the earlier gpu spike seems unrelated to invoking 'cargo test' within 'burn/crates/burn-wgpu'

system: i9-13900k cpu, 64gb ram, LSB_RELEASE: Ubuntu 22.04.3 LTS

nvidia-smi Mon Mar 11 18:33:57 2024 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 550.60.01 Driver Version: 551.76 CUDA Version: 12.4 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 4090 On | 00000000:01:00.0 On | Off | | 0% 47C P5 62W / 450W | 1173MiB / 24564MiB | 1% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | No running processes found | +-----------------------------------------------------------------------------------------+

nvcc -V nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2023 NVIDIA Corporation Built on Wed_Nov_22_10:17:15_PST_2023 Cuda compilation tools, release 12.3, V12.3.107 Build cuda_12.3.r12.3/compiler.33567101_0

any ideas why wgpu wouldn't use gpu?