Multiple training runs in parallel

tracel-ai / burn

Burn is a new comprehensive dynamic Deep Learning Framework built using Rust with extreme flexibility, compute efficiency and portability as its primary goals.

https://burn.dev

Apache License 2.0

8.65k stars 430 forks source link

Multiple training runs in parallel #715

Open wbrickner opened 1 year ago

wbrickner commented 1 year ago

Feature description

I have an optimization that is very sensitive to initialization. No idea why. Instead of getting it right with elegant math, I have found I can just try over and over until I get a good initial state.

I'm not nearly saturating my GPU's parallelism. What I want is an optimizer / training loop from burn that can perform the whole optimization process over N parameter sets, like basically adding another dimension to all the tensors (and isolating certain operations against them across this dimension).

Feature motivation

Being able to conduct N multiple training runs over an identical model architecture and loss function (possibly not the same data, not with same learning rate, not with same initialization) at the same time.

(Optional) Suggest a Solution

This might be very easy to implement with some modifications to autodiff and the optimizers. I don't have enough familiarity to day. Ideally the resulting external API will not change when using a list of learning rates / schedules, initializers, etc.

nathanielsimard commented 1 year ago

Hmm, interesting. I don't think we can support that feature by adding a batch dimension automatically. The code would be very different, as each module has its state without a batch size.

I believe the easiest and most flexible solution is probably to create one learner per model and launch them in parallel. This way, each model/experiment can have its own artifact directory with metrics, checkpoints, etc. that you can compare.

let data = ...;
let learners = [build_learner(device1, artifact1), build_learner(device2, artifact2), build_learner(device3, artifact3)];

let models = learners
    .map(|learners| (learner, data.clone()))
    .iter_par()
    .map(|(learner, data)| {
        learner.fit(data)
    });

Let me know if it helps!

wbrickner commented 1 year ago

Does burn / the underlying backend allow for sharing the GPU this way? Is this efficient? I would think this would cause massive performance loss b.c. of memory churn and lack of efficient parallelism (idk how GPU sharing works at a low level, but I assume multiple simultaneous operations do not get synchronized and coalesced by the driver into one big operation).
Can this transform be done automatically? Or is the const generic on tensor dimension + little a priori trait knowledge on how the computational graph is connected what prevents this from working? I was an Arrayfire user for a while, and it lacks any generics that communicate dimension, so you can pull these tricks easily.

If there is no better way, because of the burn compute graph design and tensor generics, etc., perhaps we can build this multithreading approach into burn itself. e.g. training could look like:

let learner = 
    LearnerBuilder::new("./artifacts")
      .metric_train_plot(LossMetric::new())
      .metric_valid_plot(LossMetric::new())
      .devices(vec![device])
      .num_epochs(50)
      .build(
        [1e-3, 3e-3, 5e-3, 8e-3]
          .map(|lr| (
            model,
            AdamConfig::new().init(),
            lr
          ))
      );

We perhaps redefine the build method to accept Iterator<Item = (Model, Optim, LR)>, allowing for full control of the set of runs conducted.

This way the ordinary usage hardly changes at all (or we could instead move this functionality to a new method build_multi):

.build([(
  model,
  AdamConfig::new().init(),
  1e-3
)]);

nathanielsimard commented 1 year ago

Does burn / the underlying backend allow for sharing the GPU this way? Is this efficient? I would think this would cause massive performance loss b.c. of memory churn and lack of efficient parallelism (idk how GPU sharing works at a low level, but I assume multiple simultaneous operations do not get synchronized and coalesced by the driver into one big operation).

I think we should make sure that you can leverage the GPU efficiently with multiple threads at the same time! CUDA has streams for that (https://developer.nvidia.com/blog/gpu-pro-tip-cuda-7-streams-simplify-concurrency/) and I think LibTorch is using them. WGPU is thread-safe; I'm not sure about the internals, but I would be interested in knowing more about how it behaves. A CPU backend will probably have no problem being executed this way.

Can this transform be done automatically? Or is the const generic on tensor dimension + little a priori trait knowledge on how the computational graph is connected what prevents this from working? I was an Arrayfire user for a while, and it lacks any generics that communicate dimension, so you can pull these tricks easily.

The problem isn't the computation graph but the modules. The Linear module has a tensor of rank two for its weights and a tensor of rank one for its bias. Having to support a batch dimension would be a significant breaking change that will affect every module, adding a lot of complexity to the API. If you are building your own modules, you can add a batch dimension to your parameters and take that approach if you want, but I don't think we should enforce it for popular modules.

This way the ordinary usage hardly changes at all (or we could instead move this functionality to a new method build_multi):

We could provide a build_multi method on the builder. It would return a list of learners instead of just one. Additionally, we could offer a function fit_all(dataloader, learners) -> Vec<Modules> to execute them all in parallel.

wbrickner commented 1 year ago

I plead that if this gets implemented, it's opaque and build_multi also returns Learner rather than Vec<Learner>, otherwise it's as easy / elegant as building my own Vec of learners.

As for the streams and underlying backend implementations, I'm a bit ignorant, I assumed that the tensor kernels being run would end up unsychronized (so data and instruction access patterns would be a lot worse). I can do a test to check the performance implications of multithreading vs "batching":

let size = 2usize.pow(26);
let batch = 16;

let x = || Tensor::<TchBackend<f32>, 2>::random_device([batch, size], Distribution::Default, &TchDevice::Mps);
let y = || Tensor::<TchBackend<f32>, 1>::random_device([size], Distribution::Default, &TchDevice::Mps);

c.bench_function("gpu_mul_batch", |bench| {
  let ab = (x(), x());

  bench.iter(|| {
    let (a, b) = black_box(ab.clone());
    let z = a * b;
    black_box(z);
  });
});

c.bench_function("gpu_mul_multithread", |bench| {
  let ab = (0..batch).map(|_| (y(), y())).collect::<Vec<_>>();

  bench.iter(|| {
    ab
      .clone()
      .into_par_iter()
      .for_each(|ab| {
        let (a, b) = black_box(ab);
        let z = a * b;
        black_box(z);
      });
  });
});

The results are very shocking:

gpu_mul_batch           time:   [39.374 ms 40.432 ms 41.563 ms]
Found 15 outliers among 100 measurements (15.00%)
  15 (15.00%) high severe

gpu_mul_multithread     time:   [34.787 ms 34.830 ms 34.878 ms]
Found 8 outliers among 100 measurements (8.00%)
  7 (7.00%) high mild
  1 (1.00%) high severe

How can this be?! My GPU is the M1 Max and has a unified memory architecture, so perhaps this result doesn't generalize to discrete GPUs.

nathanielsimard commented 1 year ago

@wbrickner I think it heavily depends on the size of the tensors. For small tensors, I expect the batching to be faster, but for big ones, I expect the multithreaded version to be equally fast. I'm also a bit surprised by the results, but I guess when working with big matrices, allocating that amount of contiguous memory is slower than allocating smaller chunks.

gcesars commented 11 months ago

This would be extremely helpful for RL use cases. I'm experimenting with Godot + Rust + Burn to build some AI for games (dummy tests for now), and being able to train several agents (small similar models) in parallel would be welcome.