Closed knok closed 4 years ago
Could there be any processes still not exited and holding to GPU memory after the previous run? E.g. does nvidia-smi
or ps aux | grep gpt-2
show anything?
It was caused on a spot-requiest instance, there are no other processes using GPU. The following is the result:
$ nvidia-smi
Fri Oct 25 09:01:20 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.00 Driver Version: 418.87.00 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 On | 00000000:00:17.0 Off | 0 |
| N/A 47C P8 27W / 149W | 0MiB / 11441MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K80 On | 00000000:00:18.0 Off | 0 |
| N/A 36C P8 29W / 149W | 0MiB / 11441MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla K80 On | 00000000:00:19.0 Off | 0 |
| N/A 50C P8 27W / 149W | 0MiB / 11441MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla K80 On | 00000000:00:1A.0 Off | 0 |
| N/A 36C P8 29W / 149W | 0MiB / 11441MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 4 Tesla K80 On | 00000000:00:1B.0 Off | 0 |
| N/A 48C P8 26W / 149W | 0MiB / 11441MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 5 Tesla K80 On | 00000000:00:1C.0 Off | 0 |
| N/A 33C P8 30W / 149W | 0MiB / 11441MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 6 Tesla K80 On | 00000000:00:1D.0 Off | 0 |
| N/A 49C P8 29W / 149W | 0MiB / 11441MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 7 Tesla K80 On | 00000000:00:1E.0 Off | 0 |
| N/A 33C P8 30W / 149W | 0MiB / 11441MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
$ ps auxw|grep gpt-2
ubuntu 24034 0.0 0.0 12948 1092 pts/1 S+ 09:01 0:00 grep --color=auto gpt-2
$ gpt-2 run-root4 data/enc2 sp-model.model
Loading dataset from data/enc2
Train dataset has 465,420,123 tokens
Validation dataset has 128,545 tokens
Loading dataset from data/enc2
Train dataset has 465,420,123 tokens
Validation dataset has 128,545 tokens
Loading dataset from data/enc2
Train dataset has 465,420,123 tokens
Validation dataset has 128,545 tokens
Loading dataset from data/enc2
Train dataset has 465,420,123 tokens
Validation dataset has 128,545 tokens
Loading dataset from data/enc2
Train dataset has 465,420,123 tokens
Validation dataset has 128,545 tokens
Loading dataset from data/enc2
Train dataset has 465,420,123 tokens
Validation dataset has 128,545 tokens
Loading dataset from data/enc2
Train dataset has 465,420,123 tokens
Validation dataset has 128,545 tokens
Resuming from seen_tokens 33,193,984
device cuda:4 initializing process group
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/envs/pytorch_p36/bin/gpt-2", line 11, in <module>
load_entry_point('lm', 'console_scripts', 'gpt-2')()
File "/home/ubuntu/efs/transformer-lm/lm/main.py", line 322, in fire_main
fire.Fire(only_allow_defined_args(main))
File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/fire/core.py", line 127, in Fire
component_trace = _Fire(component, args, context, name)
File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/fire/core.py", line 366, in _Fire
component, remaining_args)
File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/fire/core.py", line 542, in _CallCallable
result = fn(*varargs, **kwargs)
File "/home/ubuntu/efs/transformer-lm/lm/fire_utils.py", line 30, in _return_wrapped
return function_to_decorate(*args, **kwargs)
File "/home/ubuntu/efs/transformer-lm/lm/main.py", line 56, in main
mp.spawn(_main_mp, (kwargs,), n_devices)
File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 167, in spawn
while not spawn_context.join():
File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 114, in join
raise Exception(msg)
Exception:
-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
fn(i, *args)
File "/home/ubuntu/efs/transformer-lm/lm/main.py", line 318, in _main_mp
return main(**kwargs)
File "/home/ubuntu/efs/transformer-lm/lm/main.py", line 140, in main
load_model()
File "/home/ubuntu/efs/transformer-lm/lm/main.py", line 136, in load_model
optimizer.load_state_dict(torch.load(optimizer_path))
File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/serialization.py", line 387, in load
return _load(f, map_location, pickle_module, **pickle_load_args)
File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/serialization.py", line 574, in _load
result = unpickler.load()
File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/serialization.py", line 537, in persistent_load
deserialized_objects[root_key] = restore_location(obj, location)
File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/serialization.py", line 119, in default_restore_location
result = fn(storage, location)
File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/serialization.py", line 99, in _cuda_deserialize
return storage_type(obj.size())
RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 11.17 GiB total capacity; 1.38 GiB already allocated; 15.19 MiB free; 114.94 MiB cached)
I'm tying to use with Japanese wikipedia corpus, so I can give you my corpus, checkpoints, and so.
I see, thanks for checking - looks like a bug, I never tried to resume in a multi-GPU setup, but this should be fixable I hope.
@knok I don't have a multi-GPU at the moment to test, but maybe https://github.com/lopuhin/transformer-lm/commit/b10a31491c658390b0eec0e851b2c1f0a8b28b53 would fix this.
It works fine! Thank you for your quick fix.
Great, thank you for checking @knok 👍 Merged the fix into master.
I tried to resume training with gpt-2 command on AWS p2.8xlarge instance, it has 8 GPUs. So it causes "RuntimeError: CUDA out of memory.":
With the same configuration on single GPU instance, it works fine.