tracel-ai / burn

Burn is a new comprehensive dynamic Deep Learning Framework built using Rust with extreme flexibility, compute efficiency and portability as its primary goals.
https://burn.dev
Apache License 2.0
8.67k stars 430 forks source link

num_workers and iteration #2373

Closed wangjiawen2013 closed 1 week ago

wangjiawen2013 commented 1 week ago

Hi, This is the configuration of the simple regress example:

{
  "optimizer": {
    "weight_decay": null,
    "momentum": null,
    "gradient_clipping": null
  },
  "num_epochs": 100,
  "num_workers": 2,
  "seed": 42,
  "input_feature_len": 10,
  "dataset_size": 442
}

and this is the experiment.log:

2024-10-15T06:34:52.205470Z  INFO burn_train::learner::train_val: Fitting the model:
 RegressionModel {
  input_layer: Linear {d_input: 10, d_output: 64, bias: true, params: 704}
  output_layer: Linear {d_input: 64, d_output: 1, bias: true, params: 65}
  activation: Relu
  params: 769
}    
2024-10-15T06:34:52.205655Z  INFO burn_train::learner::epoch: Executing training step for epoch 1    
2024-10-15T06:34:52.247715Z  INFO burn_train::learner::epoch: Iteration 1    
2024-10-15T06:34:52.277579Z  INFO burn_train::learner::epoch: Iteration 2    
2024-10-15T06:34:52.283228Z  INFO burn_train::learner::epoch: Executing validation step for epoch 1    
2024-10-15T06:34:52.296982Z  INFO burn_train::learner::early_stopping: New best epoch found 1 Loss: 0.3906857967376709    
2024-10-15T06:34:52.296984Z  INFO burn_train::checkpoint::file: Saving checkpoint 1 to D:/LenovoQMDownload/regression\checkpoint\scheduler-1    
2024-10-15T06:34:52.296986Z  INFO burn_train::checkpoint::file: Saving checkpoint 1 to D:/LenovoQMDownload/regression\checkpoint\model-1    
2024-10-15T06:34:52.297023Z  INFO burn_train::checkpoint::file: Saving checkpoint 1 to D:/LenovoQMDownload/regression\checkpoint\optim-1    
2024-10-15T06:34:52.297654Z  INFO burn_train::learner::epoch: Executing training step for epoch 2    
2024-10-15T06:34:52.342163Z  INFO burn_train::learner::epoch: Iteration 1    
2024-10-15T06:34:52.355429Z  INFO burn_train::learner::epoch: Iteration 2    

In my opinion, the iteration is independent from num_workers. Because the dataset size of this example is small, we do full batch gradient descent and set batch size equivalent to size of dataset, so the iteration should be 1. However, as you can see here, the iteration equals to num_workers, both of them are 2. I set different num_workers (2, 10, 15, 20) and the iteration always equals to num_workers. What's the relationship between iteration and num_workers ?

laggui commented 1 week ago

This is because of the multithreaded dataloader implementation.

The number of workers is actually prioritized to split the data loading workload, so unless you set the number of workers to 1 then there will always at least be num_workers batches (and therefore, num_workers iterations) with the current implementation.

I don't think this is explicitly detailed anywhere, so this might cause confusion and unexpected behavior (as it has for you) 🤔

wangjiawen2013 commented 1 week ago

Thanks, perfectly clear ! Unless setting the number of workers to 1 then there will always at least be num_workers batches