tracel-ai / burn

Burn is a new comprehensive dynamic Deep Learning Framework built using Rust with extreme flexibility, compute efficiency and portability as its primary goals.
https://burn.dev
Apache License 2.0
8.85k stars 434 forks source link

Too many bindings of type StorageBuffers in Stage ShaderStages(COMPUTE) #1970

Open ianmarmour opened 4 months ago

ianmarmour commented 4 months ago

Describe the bug

When attempting to perform any training of my model that with a batch size > 1 there are too many bindings of type StorageBuffer. This leads to a panic in WGPU and causes the training process to crash. It appears that the correct limits for the adapter are being inferred but potentially not respected by fusion if fusion is disabled this error isn't encountered.

To Reproduce

  1. Use WGPU as the device type on a Mac.
  2. .... unsure...
  3. cargo run to perform training with batch size > 1
  4. On the second batch of items the panic will occur.

Expected behavior

Proper allocation of the correct number of maximum available StorageBuffers.

Screenshots

2024-07-04T01:51:02.976015Z  INFO burn_fusion::stream::store::base: New execution plan 76 - Operations: 1 - Triggers 1    
2024-07-04T01:51:02.976063Z  INFO burn_fusion::stream::store::base: New execution plan 77 - Operations: 3 - Triggers 1    
2024-07-04T01:51:02.985188Z  INFO burn_jit::fusion::kernel: Compiling ... "mri0pvnx16y16z16g0vs7ubdfg"    
2024-07-04T01:51:03.194143Z  INFO burn_jit::fusion::kernel: Compiling ... "mri0pvnx16y16z1675o2aql5m4"    
2024-07-04T01:51:03.404701Z  INFO burn_compute::tune::tuner: Fastest result burn_jit::fusion::kernel::AutotunableKernel<burn_wgpu::runtime::WgpuRuntime>-Fusion ElemWise - num_operations: 4 shape: [32, 64, 2, 64]    
2024-07-04T01:51:03.428641Z  INFO burn_jit::fusion::kernel: Compiling ... "mi0o0ri0pvnx16y16z16g0vs7ubdfg"    
2024-07-04T01:51:03.515667Z  INFO burn_fusion::stream::store::base: New execution plan 78 - Operations: 99 - Triggers 1    
2024-07-04T01:51:03.595443Z  INFO burn_jit::fusion::kernel: Compiling ... "mri0pi2pi3pi4pi5pi6pi7pi8pi9pi10pi11pi12pi13pi14pi15pi16pi17pi18pi19pi20pi21pi22pi23pi24pi25pi26pi27pi28pi29pi30pi31pi32pi33pi34pi35pi36pi37pi38pi39pi40pi41pi42pi43pi44pi45pi46pi47pi48pi49pi50pi51pi52pi53pi54pi55pi56pi57pi58pi59pi60pi61pi62pi63pi64pi66pvnx16y16z160u635da9uo"    
2024-07-04T01:51:03.597239Z ERROR wgpu::backend::wgpu_core: Handling wgpu errors as fatal by default    
2024-07-04T01:51:03.597268Z ERROR burn_train::learner::application_logger: PANIC => panicked at /Users/ian/.cargo/registry/src/index.crates.io-6f17d22bba15001f/wgpu-0.20.1/src/backend/wgpu_core.rs:2996:5:
wgpu error: Validation Error

Caused by:
    In Device::create_compute_pipeline
    Unable to derive an implicit layout
    Too many bindings of type StorageBuffers in Stage ShaderStages(COMPUTE), limit is 31, count was 102. Check the limit `max_storage_buffers_per_shader_stage` passed to `Adapter::request_device`

Desktop (please complete the following information):

Additional context

Training Configuration

#[derive(Config)]
pub struct TrainingConfig {
    pub model: TasNetConfig,
    pub optimizer: AdamConfig,
    #[config(default = 10)]
    pub num_epochs: usize,
    #[config(default = 2)]
    pub batch_size: usize,
    #[config(default = 4)]
    pub num_workers: usize,
    #[config(default = 42)]
    pub seed: u64,
    #[config(default = 3.0e-4)]
    pub learning_rate: f64,
}
fn main() {
    let device = WgpuDevice::BestAvailable;

    training::train::<Autodiff<Wgpu>>(
        "/tmp/guide",
        training::TrainingConfig::new(TasNetConfig::new(500, 40, 1000, 10, 2), AdamConfig::new()),
        device,
    );
}

Metal GPU Feature Set Table: Indicates the inferred maximum limit of 31 is correct.

BjornTheProgrammer commented 2 weeks ago

I was able to do batch size of max 7 before getting the same error with the following model

Model:
Model {
  linear1: Linear {d_input: 192, d_output: 10, bias: true, params: 1930}
  linear2: Linear {d_input: 10, d_output: 10, bias: true, params: 110}
  linear3: Linear {d_input: 10, d_output: 3, bias: true, params: 33}
  dropout: Dropout {prob: 0.5}
  activation: Relu
  params: 2073
}
Total Epochs: 10
#[derive(Config)]
pub struct TrainingConfig {
    pub optimizer: AdamConfig,
    #[config(default = 10)]
    pub num_epochs: usize,
    #[config(default = 7)]
    pub batch_size: usize,
    #[config(default = 4)]
    pub num_workers: usize,
    #[config(default = 42)]
    pub seed: u64,
    #[config(default = 1.0e-4)]
    pub learning_rate: f64,
    #[config(default = true)]
    pub custom_renderer: bool,
}

I suspect this is due to the number of tensors created somehow. As this only became an issue when I implemented the following code

// This is a single batch, it takes in inputs tensor, and an outputs tensor.
// Inputs is a 2d array of the 192 sensor inputs
// [[f32; 192]; n] where n is the number of items in the batch
// targets is the corresponding label
#[derive(Clone, Debug)]
pub struct DevilBatch<B: Backend> {
    pub inputs: Tensor<B, 2>,
    pub targets: Tensor<B, 1>,
}

impl<B: Backend> ModelForwardStep<B> for Model<B> {
    fn forward_step(&self, item: DevilBatch<B>) -> RegressionOutput<B> {
        let num_chunks = item.inputs.dims()[0];
        let chunks = item.inputs.chunk(num_chunks, 0);

        let inputs: Vec<_> = chunks.into_iter().map(|item| item.squeeze::<1>(0)).collect();
        let outputs: Vec<_> = inputs.into_iter().map(|input| self.forward(input).unsqueeze()).collect();
        let outputs: Tensor<B, 2> = Tensor::cat(outputs, 0);

        let targets: Tensor<B, 2> = item.targets.unsqueeze_dim(1);

        let loss = MseLoss::new().forward(outputs.clone(), targets.clone(), nn::loss::Reduction::Mean);

        RegressionOutput {
            loss,
            output: outputs,
            targets,
        }
    }
}

Previously all I had done was make the model accept a Tensor2D and return a Tensor2D, but now it accepts a Tensor1D and returns a Tensor1D, and that is when I got the error.

I could be wrong, but hopefully this provides a lead.

nathanielsimard commented 1 week ago

Just to make sure, the problem only happens when the feature fusion is enabled?

BjornTheProgrammer commented 1 week ago

These are the only features I’m using,features = ["default", "train", "wgpu" ].

I have the git repo here, if this helps.