Text classification example gives "Shader validation error" when run on multiple GPUs

joshhansen commented 5 months ago

Describe the bug Running the text classification example's ag news training step on multiple discrete GPUs fails with "Shader validation error":

This error overlaps some with the one in #1088.

To Reproduce On a system with two or more discrete GPUs:

git clone https://github.com/tracel-ai/burn.git
cd burn/examples/text-classification

Edit examples/ag-news-train.rs like so:

-        launch::<Autodiff<Wgpu<AutoGraphicsApi, ElemType, i32>>>(vec![WgpuDevice::default()]);
+        launch::<Autodiff<Wgpu<AutoGraphicsApi, ElemType, i32>>>(vec![
+            WgpuDevice::DiscreteGpu(0),
+            WgpuDevice::DiscreteGpu(1),
+        ]);

cargo run --example ag-news-train --features wgpu

Expected behavior The training proceeds, utilizing both GPUs.

Desktop (please complete the following information):

OS: Linux Mint 21.3 Cinnamon
Kernel 6.5.0-28-generic
Burn 0.14 master commit: a8661a2f53df6790b875d55cd89f51d3944b714b
Threadripper 7965WX on ASUS WRX90E-SAGE
4x RTX 6000 Ada GPUs
Nvidia 545.29.06

nathanielsimard commented 5 months ago

Looking at the experiment.log the problem seems to come from the validation layer of Vulkan, not from a multi-device error. I tested on my system and I can run the training with multiple devices. Maybe you can try to disable the validation layer of Vulkan (branch wgpu-no-validation).

Also, you could test using the LibTorch backend instead.

joshhansen commented 5 months ago

Training does appear to work with the LibTorch GPU backend, with multiple GPUs specified. That may not be much use to me though - I am specifically migrating away from libtorch due to its lack of thread safety.

Running on the wgpu-no-validation branch surprisingly results in the same validation error: experiment.log

nathanielsimard commented 5 months ago

@joshhansen My intuition would suggest that the problem may come from a precision error, where wgpu can't convert the literal to a float32. If you change that value, does it work?

joshhansen commented 5 months ago

Change 0.00000000023283064365386963f? My apologies, I'm not familiar with Burn's compilation process, where would that value "live" such that I could modify it?

nathanielsimard commented 4 months ago

@joshhansen I guessed it was a constant defined by your code 😅

tracel-ai / burn

Text classification example gives "Shader validation error" when run on multiple GPUs #1745