Fail to resume on multiple gpu

knok commented 4 years ago

I tried to resume training with gpt-2 command on AWS p2.8xlarge instance, it has 8 GPUs. So it causes "RuntimeError: CUDA out of memory.":

  File "/home/ubuntu/efs/transformer-lm/lm/main.py", line 140, in main
    load_model()
  File "/home/ubuntu/efs/transformer-lm/lm/main.py", line 136, in load_model
    optimizer.load_state_dict(torch.load(optimizer_path))
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/serialization.py", line 387, in load
    return _load(f, map_location, pickle_module, **pickle_load_args)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/serialization.py", line 574, in _load
    result = unpickler.load()
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/serialization.py", line 537, in persistent_load
    deserialized_objects[root_key] = restore_location(obj, location)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/serialization.py", line 119, in default_restore_location
    result = fn(storage, location)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/serialization.py", line 99, in _cuda_deserialize
    return storage_type(obj.size())
RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 11.17 GiB total capacity; 962.69 MiB already allocated; 17.19 MiB free; 63.31 MiB cached)

With the same configuration on single GPU instance, it works fine.

lopuhin commented 4 years ago

Could there be any processes still not exited and holding to GPU memory after the previous run? E.g. does nvidia-smi or ps aux | grep gpt-2 show anything?

knok commented 4 years ago

It was caused on a spot-requiest instance, there are no other processes using GPU. The following is the result:

$ nvidia-smi
Fri Oct 25 09:01:20 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.00    Driver Version: 418.87.00    CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           On   | 00000000:00:17.0 Off |                    0 |
| N/A   47C    P8    27W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K80           On   | 00000000:00:18.0 Off |                    0 |
| N/A   36C    P8    29W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla K80           On   | 00000000:00:19.0 Off |                    0 |
| N/A   50C    P8    27W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla K80           On   | 00000000:00:1A.0 Off |                    0 |
| N/A   36C    P8    29W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   4  Tesla K80           On   | 00000000:00:1B.0 Off |                    0 |
| N/A   48C    P8    26W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   5  Tesla K80           On   | 00000000:00:1C.0 Off |                    0 |
| N/A   33C    P8    30W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   6  Tesla K80           On   | 00000000:00:1D.0 Off |                    0 |
| N/A   49C    P8    29W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   7  Tesla K80           On   | 00000000:00:1E.0 Off |                    0 |
| N/A   33C    P8    30W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
$ ps auxw|grep gpt-2
ubuntu    24034  0.0  0.0  12948  1092 pts/1    S+   09:01   0:00 grep --color=auto gpt-2
$ gpt-2 run-root4 data/enc2 sp-model.model
Loading dataset from data/enc2
Train dataset has 465,420,123 tokens
Validation dataset has 128,545 tokens
Loading dataset from data/enc2
Train dataset has 465,420,123 tokens
Validation dataset has 128,545 tokens
Loading dataset from data/enc2
Train dataset has 465,420,123 tokens
Validation dataset has 128,545 tokens
Loading dataset from data/enc2
Train dataset has 465,420,123 tokens
Validation dataset has 128,545 tokens
Loading dataset from data/enc2
Train dataset has 465,420,123 tokens
Validation dataset has 128,545 tokens
Loading dataset from data/enc2
Train dataset has 465,420,123 tokens
Validation dataset has 128,545 tokens
Loading dataset from data/enc2
Train dataset has 465,420,123 tokens
Validation dataset has 128,545 tokens
Resuming from seen_tokens 33,193,984
device cuda:4 initializing process group
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/bin/gpt-2", line 11, in <module>
    load_entry_point('lm', 'console_scripts', 'gpt-2')()
  File "/home/ubuntu/efs/transformer-lm/lm/main.py", line 322, in fire_main
    fire.Fire(only_allow_defined_args(main))
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/fire/core.py", line 127, in Fire
    component_trace = _Fire(component, args, context, name)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/fire/core.py", line 366, in _Fire
    component, remaining_args)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/fire/core.py", line 542, in _CallCallable
    result = fn(*varargs, **kwargs)
  File "/home/ubuntu/efs/transformer-lm/lm/fire_utils.py", line 30, in _return_wrapped
    return function_to_decorate(*args, **kwargs)
  File "/home/ubuntu/efs/transformer-lm/lm/main.py", line 56, in main
    mp.spawn(_main_mp, (kwargs,), n_devices)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 167, in spawn
    while not spawn_context.join():
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 114, in join
    raise Exception(msg)
Exception:

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
    fn(i, *args)
  File "/home/ubuntu/efs/transformer-lm/lm/main.py", line 318, in _main_mp
    return main(**kwargs)
  File "/home/ubuntu/efs/transformer-lm/lm/main.py", line 140, in main
    load_model()
  File "/home/ubuntu/efs/transformer-lm/lm/main.py", line 136, in load_model
    optimizer.load_state_dict(torch.load(optimizer_path))
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/serialization.py", line 387, in load
    return _load(f, map_location, pickle_module, **pickle_load_args)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/serialization.py", line 574, in _load
    result = unpickler.load()
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/serialization.py", line 537, in persistent_load
    deserialized_objects[root_key] = restore_location(obj, location)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/serialization.py", line 119, in default_restore_location
    result = fn(storage, location)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/serialization.py", line 99, in _cuda_deserialize
    return storage_type(obj.size())
RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 11.17 GiB total capacity; 1.38 GiB already allocated; 15.19 MiB free; 114.94 MiB cached)

knok commented 4 years ago

I'm tying to use with Japanese wikipedia corpus, so I can give you my corpus, checkpoints, and so.

lopuhin commented 4 years ago

I see, thanks for checking - looks like a bug, I never tried to resume in a multi-GPU setup, but this should be fixable I hope.

lopuhin commented 4 years ago

@knok I don't have a multi-GPU at the moment to test, but maybe https://github.com/lopuhin/transformer-lm/commit/b10a31491c658390b0eec0e851b2c1f0a8b28b53 would fix this.

knok commented 4 years ago

It works fine! Thank you for your quick fix.

lopuhin commented 4 years ago

Great, thank you for checking @knok 👍 Merged the fix into master.

lopuhin / transformer-lm

Fail to resume on multiple gpu #21