neptune-ai / open-solution-mapping-challenge

Open solution to the Mapping Challenge :earth_americas:
https://www.crowdai.org/challenges/mapping-challenge
MIT License
377 stars 96 forks source link

CUDA Memory Errors at first epoch at default batch size #203

Open cvKDean opened 5 years ago

cvKDean commented 5 years ago

Good day, I would just like to ask if you guys have any idea why I am running into CUDA memory errors when running training? This happens at the end of the first epoch (epoch 0). For reference, I am just trying to reproduce the results in REPRODUCE_RESULTS.md with the smaller dataset with annotation-small.json.

My configuration is: OS: Windows 10 (Anaconda Prompt) GPU: GeForce GTX 1070Ti (single) torch version: 1.0.1

The error stack is as follows: Error stack:

2019-03-22 14-23-05 steps >>> epoch 0 average batch time: 0:00:00.7
2019-03-22 14-23-06 steps >>> epoch 0 batch 411 sum:     1.74406
2019-03-22 14-23-07 steps >>> epoch 0 batch 412 sum:     2.26457
2019-03-22 14-23-07 steps >>> epoch 0 batch 413 sum:     1.95351
2019-03-22 14-23-08 steps >>> epoch 0 batch 414 sum:     2.39538
2019-03-22 14-23-09 steps >>> epoch 0 batch 415 sum:     1.83759
2019-03-22 14-23-10 steps >>> epoch 0 batch 416 sum:     1.92264
2019-03-22 14-23-10 steps >>> epoch 0 batch 417 sum:     1.71246
2019-03-22 14-23-11 steps >>> epoch 0 batch 418 sum:     2.32141
2019-03-22 14-23-11 steps >>> epoch 0 sum:     2.18943
neptune: Executing in Offline Mode.
B:\ML Models\src\utils.py:30: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
  config = yaml.load(f)
neptune: Executing in Offline Mode.
B:\ML Models\src\utils.py:30: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
  config = yaml.load(f)
neptune: Executing in Offline Mode.
B:\ML Models\src\utils.py:30: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
  config = yaml.load(f)
neptune: Executing in Offline Mode.
B:\ML Models\src\utils.py:30: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
  config = yaml.load(f)
neptune: Executing in Offline Mode.
B:\ML Models\src\utils.py:30: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
  config = yaml.load(f)
neptune: Executing in Offline Mode.
B:\ML Models\src\utils.py:30: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
  config = yaml.load(f)
neptune: Executing in Offline Mode.
B:\ML Models\src\utils.py:30: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
  config = yaml.load(f)
neptune: Executing in Offline Mode.
B:\ML Models\src\utils.py:30: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
  config = yaml.load(f)
B:\ML Models\src\callbacks.py:168: UserWarning: volatile was removed and now has no effect. Use `with torch.no_grad():` instead.
  X = Variable(X, volatile=True).cuda()
Traceback (most recent call last):
  File "main.py", line 93, in <module>
    main()
  File "C:\Users\AIC-WS1\Anaconda3\envs\neptune.ml\lib\site-packages\click\core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  File "C:\Users\AIC-WS1\Anaconda3\envs\neptune.ml\lib\site-packages\click\core.py", line 697, in main
    rv = self.invoke(ctx)
  File "C:\Users\AIC-WS1\Anaconda3\envs\neptune.ml\lib\site-packages\click\core.py", line 1066, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "C:\Users\AIC-WS1\Anaconda3\envs\neptune.ml\lib\site-packages\click\core.py", line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "C:\Users\AIC-WS1\Anaconda3\envs\neptune.ml\lib\site-packages\click\core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "main.py", line 31, in train
    pipeline_manager.train(pipeline_name, dev_mode)
  File "B:\ML Models\src\pipeline_manager.py", line 32, in train
    train(pipeline_name, dev_mode, self.logger, self.params, self.seed)
  File "B:\ML Models\src\pipeline_manager.py", line 116, in train
    pipeline.fit_transform(data)
  File "B:\ML Models\src\steps\base.py", line 106, in fit_transform
    step_inputs[input_step.name] = input_step.fit_transform(data)
  File "B:\ML Models\src\steps\base.py", line 106, in fit_transform
    step_inputs[input_step.name] = input_step.fit_transform(data)
  File "B:\ML Models\src\steps\base.py", line 106, in fit_transform
    step_inputs[input_step.name] = input_step.fit_transform(data)
  [Previous line repeated 4 more times]
  File "B:\ML Models\src\steps\base.py", line 112, in fit_transform
    return self._cached_fit_transform(step_inputs)
  File "B:\ML Models\src\steps\base.py", line 123, in _cached_fit_transform
    step_output_data = self.transformer.fit_transform(**step_inputs)
  File "B:\ML Models\src\steps\base.py", line 262, in fit_transform
    self.fit(*args, **kwargs)
  File "B:\ML Models\src\models.py", line 82, in fit
    self.callbacks.on_epoch_end()
  File "B:\ML Models\src\steps\pytorch\callbacks.py", line 92, in on_epoch_end
    callback.on_epoch_end(*args, **kwargs)
  File "B:\ML Models\src\steps\pytorch\callbacks.py", line 163, in on_epoch_end
    val_loss = self.get_validation_loss()
  File "B:\ML Models\src\callbacks.py", line 132, in get_validation_loss
    return self._get_validation_loss()
  File "B:\ML Models\src\callbacks.py", line 138, in _get_validation_loss
    outputs = self._transform()
  File "B:\ML Models\src\callbacks.py", line 172, in _transform
    outputs_batch = self.model(X)
  File "C:\Users\AIC-WS1\Anaconda3\envs\neptune.ml\lib\site-packages\torch\nn\modules\module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "C:\Users\AIC-WS1\Anaconda3\envs\neptune.ml\lib\site-packages\torch\nn\parallel\data_parallel.py", line 141, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "C:\Users\AIC-WS1\Anaconda3\envs\neptune.ml\lib\site-packages\torch\nn\modules\module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "B:\ML Models\src\unet_models.py", line 387, in forward
    conv2 = self.conv2(conv1)
  File "C:\Users\AIC-WS1\Anaconda3\envs\neptune.ml\lib\site-packages\torch\nn\modules\module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "C:\Users\AIC-WS1\Anaconda3\envs\neptune.ml\lib\site-packages\torch\nn\modules\container.py", line 92, in forward
    input = module(input)
  File "C:\Users\AIC-WS1\Anaconda3\envs\neptune.ml\lib\site-packages\torch\nn\modules\module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "C:\Users\AIC-WS1\Anaconda3\envs\neptune.ml\lib\site-packages\torchvision\models\resnet.py", line 88, in forward
    out = self.bn3(out)
  File "C:\Users\AIC-WS1\Anaconda3\envs\neptune.ml\lib\site-packages\torch\nn\modules\module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "C:\Users\AIC-WS1\Anaconda3\envs\neptune.ml\lib\site-packages\torch\nn\modules\batchnorm.py", line 76, in forward
    exponential_average_factor, self.eps)
  File "C:\Users\AIC-WS1\Anaconda3\envs\neptune.ml\lib\site-packages\torch\nn\functional.py", line 1623, in batch_norm
    training, momentum, eps, torch.backends.cudnn.enabled
RuntimeError: CUDA out of memory. Tried to allocate 80.00 MiB (GPU 0; 8.00 GiB total capacity; 6.18 GiB already allocated; 56.00 MiB free; 48.95 MiB cached)

Lowering the batch size from the default 20 to 10 decreased the memory usage of the GPU from ~6GB to ~4GB, and at the end of epoch 0, increased the memory usage to ~6GB. Afterwards, subsequent epochs have continued to run in training at memory usage of ~6GB.

Is this behavior to be expected/normal? I read somewhere that you also used GTX 1070 GPUs for training, and so I thought I would be able to run training at the default batch size. Also, is it normal for GPU memory usage to increase between epochs 0 and 1? Thank you!

zeciro commented 3 years ago

Hi,

I have the same issue. After the first epoch I get: RuntimeError: CUDA out of memory. Tried to allocate 40.00 MiB (GPU 0; 8.00 GiB total capacity; 6.43 GiB already allocated; 0 bytes free; 6.53 GiB reserved in total by PyTorch)

I am running the mapping challenge dataset.

I have experimented with varying batch sizes and also number workers, but the problem occurs no matter the settings.

Update: Significantly reducing the batch size has solved that issue for me. (From 20 to 8).