Open ianmarmour opened 4 months ago
I was able to do batch size of max 7 before getting the same error with the following model
Model:
Model {
linear1: Linear {d_input: 192, d_output: 10, bias: true, params: 1930}
linear2: Linear {d_input: 10, d_output: 10, bias: true, params: 110}
linear3: Linear {d_input: 10, d_output: 3, bias: true, params: 33}
dropout: Dropout {prob: 0.5}
activation: Relu
params: 2073
}
Total Epochs: 10
#[derive(Config)]
pub struct TrainingConfig {
pub optimizer: AdamConfig,
#[config(default = 10)]
pub num_epochs: usize,
#[config(default = 7)]
pub batch_size: usize,
#[config(default = 4)]
pub num_workers: usize,
#[config(default = 42)]
pub seed: u64,
#[config(default = 1.0e-4)]
pub learning_rate: f64,
#[config(default = true)]
pub custom_renderer: bool,
}
I suspect this is due to the number of tensors created somehow. As this only became an issue when I implemented the following code
// This is a single batch, it takes in inputs tensor, and an outputs tensor.
// Inputs is a 2d array of the 192 sensor inputs
// [[f32; 192]; n] where n is the number of items in the batch
// targets is the corresponding label
#[derive(Clone, Debug)]
pub struct DevilBatch<B: Backend> {
pub inputs: Tensor<B, 2>,
pub targets: Tensor<B, 1>,
}
impl<B: Backend> ModelForwardStep<B> for Model<B> {
fn forward_step(&self, item: DevilBatch<B>) -> RegressionOutput<B> {
let num_chunks = item.inputs.dims()[0];
let chunks = item.inputs.chunk(num_chunks, 0);
let inputs: Vec<_> = chunks.into_iter().map(|item| item.squeeze::<1>(0)).collect();
let outputs: Vec<_> = inputs.into_iter().map(|input| self.forward(input).unsqueeze()).collect();
let outputs: Tensor<B, 2> = Tensor::cat(outputs, 0);
let targets: Tensor<B, 2> = item.targets.unsqueeze_dim(1);
let loss = MseLoss::new().forward(outputs.clone(), targets.clone(), nn::loss::Reduction::Mean);
RegressionOutput {
loss,
output: outputs,
targets,
}
}
}
Previously all I had done was make the model accept a Tensor2D and return a Tensor2D, but now it accepts a Tensor1D and returns a Tensor1D, and that is when I got the error.
I could be wrong, but hopefully this provides a lead.
Just to make sure, the problem only happens when the feature fusion
is enabled?
These are the only features I’m using,features = ["default", "train", "wgpu" ]
.
I have the git repo here, if this helps.
Describe the bug
When attempting to perform any training of my model that with a batch size > 1 there are too many bindings of type
StorageBuffer
. This leads to a panic in WGPU and causes the training process to crash. It appears that the correct limits for the adapter are being inferred but potentially not respected by fusion if fusion is disabled this error isn't encountered.To Reproduce
cargo run
to perform training with batch size > 1Expected behavior
Proper allocation of the correct number of maximum available
StorageBuffers
.Screenshots
Desktop (please complete the following information):
Additional context
Training Configuration
Metal GPU Feature Set Table: Indicates the inferred maximum limit of 31 is correct.