2024-01-24 10:35:03.180 | INFO | __main__:main:723 - *** Train ***
2024-01-24 10:35:05.503 | DEBUG | __main__:main:724 - Train dataloader example: {'input_ids': tensor([[59652, 101, 59678, ..., 60485, 60688, 59828],
[ 106, 59810, 40183, ..., 59652, 7770, 61186]], device='cuda:2'), 'attention_mask': tensor([[1, 1, 1, ..., 1, 1, 1],
[1, 1, 1, ..., 1, 1, 1]], device='cuda:2'), 'labels': tensor([[59652, 101, 59678, ..., 60485, 60688, 59828],
[ 106, 59810, 40183, ..., 59652, 7770, 61186]], device='cuda:2')}
2024-01-24 10:35:05.857 | DEBUG | __main__:main:724 - Train dataloader example: {'input_ids': tensor([[59778, 102, 3215, ..., 37469, 59599, 5845],
[59722, 101, 60079, ..., 101, 59784, 23307]], device='cuda:3'), 'attention_mask': tensor([[1, 1, 1, ..., 1, 1, 1],
[1, 1, 1, ..., 1, 1, 1]], device='cuda:3'), 'labels': tensor([[59778, 102, 3215, ..., 37469, 59599, 5845],
[59722, 101, 60079, ..., 101, 59784, 23307]], device='cuda:3')}
2024-01-24 10:35:06.311 | DEBUG | __main__:main:724 - Train dataloader example: {'input_ids': tensor([[59693, 59780, 81, ..., 101, 28770, 60158],
[20438, 59748, 5487, ..., 59773, 101, 59818]], device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, ..., 1, 1, 1],
[1, 1, 1, ..., 1, 1, 1]], device='cuda:0'), 'labels': tensor([[59693, 59780, 81, ..., 101, 28770, 60158],
[20438, 59748, 5487, ..., 59773, 101, 59818]], device='cuda:0')}
2024-01-24 10:35:06.671 | DEBUG | __main__:main:724 - Train dataloader example: {'input_ids': tensor([[59740, 59795, 9843, ..., 59626, 59763, 60025],
[28540, 60660, 60054, ..., 77, 77, 79]], device='cuda:4'), 'attention_mask': tensor([[1, 1, 1, ..., 1, 1, 1],
[1, 1, 1, ..., 1, 1, 1]], device='cuda:4'), 'labels': tensor([[59740, 59795, 9843, ..., 59626, 59763, 60025],
[28540, 60660, 60054, ..., 77, 77, 79]], device='cuda:4')}
2024-01-24 10:35:06.687 | DEBUG | __main__:main:724 - Train dataloader example: {'input_ids': tensor([[61196, 60113, 105, ..., 536, 457, 457],
[21211, 101, 6427, ..., 59728, 10856, 59676]], device='cuda:7'), 'attention_mask': tensor([[1, 1, 1, ..., 1, 1, 1],
[1, 1, 1, ..., 1, 1, 1]], device='cuda:7'), 'labels': tensor([[61196, 60113, 105, ..., 536, 457, 457],
[21211, 101, 6427, ..., 59728, 10856, 59676]], device='cuda:7')}
2024-01-24 10:35:06.842 | DEBUG | __main__:main:724 - Train dataloader example: {'input_ids': tensor([[59652, 4387, 12486, ..., 60158, 4898, 21567],
[59604, 106, 78, ..., 60545, 60164, 61068]], device='cuda:6'), 'attention_mask': tensor([[1, 1, 1, ..., 1, 1, 1],
[1, 1, 1, ..., 1, 1, 1]], device='cuda:6'), 'labels': tensor([[59652, 4387, 12486, ..., 60158, 4898, 21567],
[59604, 106, 78, ..., 60545, 60164, 61068]], device='cuda:6')}
2024-01-24 10:35:06.904 | DEBUG | __main__:main:724 - Train dataloader example: {'input_ids': tensor([[ 7019, 8635, 6571, ..., 24141, 21391, 102],
[ 100, 59568, 60509, ..., 60781, 102, 80]], device='cuda:1'), 'attention_mask': tensor([[1, 1, 1, ..., 1, 1, 1],
[1, 1, 1, ..., 1, 1, 1]], device='cuda:1'), 'labels': tensor([[ 7019, 8635, 6571, ..., 24141, 21391, 102],
[ 100, 59568, 60509, ..., 60781, 102, 80]], device='cuda:1')}
2024-01-24 10:35:06.912 | DEBUG | __main__:main:724 - Train dataloader example: {'input_ids': tensor([[ 4202, 101, 2363, ..., 59932, 41236, 59599],
[59594, 85, 23340, ..., 2363, 2480, 2598]], device='cuda:5'), 'attention_mask': tensor([[1, 1, 1, ..., 1, 1, 1],
[1, 1, 1, ..., 1, 1, 1]], device='cuda:5'), 'labels': tensor([[ 4202, 101, 2363, ..., 59932, 41236, 59599],
[59594, 85, 23340, ..., 2363, 2480, 2598]], device='cuda:5')}
Traceback (most recent call last):
File "pretraining.py", line 767, in <module>
main()
File "pretraining.py", line 728, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/transformers/trainer.py", line 1537, in train
return inner_training_loop(
File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/transformers/trainer.py", line 1672, in _inner_training_loop
model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/accelerate/accelerator.py", line 1227, in prepare
result = tuple(
File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/accelerate/accelerator.py", line 1228, in <genexpr>
self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/accelerate/accelerator.py", line 1104, in _prepare_one
return self.prepare_model(obj, device_placement=device_placement)
File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/accelerate/accelerator.py", line 1355, in prepare_model
model = torch.nn.parallel.DistributedDataParallel(
File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 809, in __init__
self._ddp_init_helper(
File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1098, in _ddp_init_helper
self.reducer = dist.Reducer(
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 280.00 MiB. GPU 7 has a total capacty of 79.33 GiB of which 187.06 MiB is free. Process 3001206 has 6.43 GiB memory in use. Process 3007944 has 3.10 GiB memory in use. Including non-PyTorch memory, this process has 69.60 GiB memory in use. Of the allocated memory 68.69 GiB is allocated by PyTorch, and 6.23 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Traceback (most recent call last):
File "pretraining.py", line 767, in <module>
main()
File "pretraining.py", line 728, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/transformers/trainer.py", line 1537, in train
return inner_training_loop(
File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/transformers/trainer.py", line 1672, in _inner_training_loop
model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/accelerate/accelerator.py", line 1227, in prepare
result = tuple(
File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/accelerate/accelerator.py", line 1228, in <genexpr>
self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/accelerate/accelerator.py", line 1104, in _prepare_one
return self.prepare_model(obj, device_placement=device_placement)
File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/accelerate/accelerator.py", line 1355, in prepare_model
model = torch.nn.parallel.DistributedDataParallel(
File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 809, in __init__
self._ddp_init_helper(
File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1098, in _ddp_init_helper
self.reducer = dist.Reducer(
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 280.00 MiB. GPU 6 has a total capacty of 79.33 GiB of which 229.81 MiB is free. Including non-PyTorch memory, this process has 79.09 GiB memory in use. Of the allocated memory 78.04 GiB is allocated by PyTorch, and 53.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Traceback (most recent call last):
File "pretraining.py", line 767, in <module>
main()
File "pretraining.py", line 728, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/transformers/trainer.py", line 1537, in train
return inner_training_loop(
File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/transformers/trainer.py", line 1672, in _inner_training_loop
model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/accelerate/accelerator.py", line 1227, in prepare
result = tuple(
File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/accelerate/accelerator.py", line 1228, in <genexpr>
self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/accelerate/accelerator.py", line 1104, in _prepare_one
return self.prepare_model(obj, device_placement=device_placement)
File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/accelerate/accelerator.py", line 1355, in prepare_model
model = torch.nn.parallel.DistributedDataParallel(
File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 809, in __init__
self._ddp_init_helper(
File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1098, in _ddp_init_helper
self.reducer = dist.Reducer(
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 280.00 MiB. GPU 1 has a total capacty of 79.33 GiB of which 190.44 MiB is free. Process 2993049 has 3.16 GiB memory in use. Including non-PyTorch memory, this process has 75.97 GiB memory in use. Of the allocated memory 74.93 GiB is allocated by PyTorch, and 48.07 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Traceback (most recent call last):
File "pretraining.py", line 767, in <module>
main()
File "pretraining.py", line 728, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/transformers/trainer.py", line 1537, in train
return inner_training_loop(
File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/transformers/trainer.py", line 1672, in _inner_training_loop
model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/accelerate/accelerator.py", line 1227, in prepare
result = tuple(
File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/accelerate/accelerator.py", line 1228, in <genexpr>
self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/accelerate/accelerator.py", line 1104, in _prepare_one
return self.prepare_model(obj, device_placement=device_placement)
File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/accelerate/accelerator.py", line 1355, in prepare_model
model = torch.nn.parallel.DistributedDataParallel(
File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 809, in __init__
self._ddp_init_helper(
File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1098, in _ddp_init_helper
self.reducer = dist.Reducer(
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 280.00 MiB. GPU 5 has a total capacty of 79.33 GiB of which 259.81 MiB is free. Including non-PyTorch memory, this process has 79.06 GiB memory in use. Of the allocated memory 78.04 GiB is allocated by PyTorch, and 23.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Traceback (most recent call last):
File "pretraining.py", line 767, in <module>
main()
File "pretraining.py", line 728, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/transformers/trainer.py", line 1537, in train
return inner_training_loop(
File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/transformers/trainer.py", line 1672, in _inner_training_loop
model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/accelerate/accelerator.py", line 1227, in prepare
result = tuple(
File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/accelerate/accelerator.py", line 1228, in <genexpr>
self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/accelerate/accelerator.py", line 1104, in _prepare_one
return self.prepare_model(obj, device_placement=device_placement)
File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/accelerate/accelerator.py", line 1355, in prepare_model
model = torch.nn.parallel.DistributedDataParallel(
File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 809, in __init__
self._ddp_init_helper(
File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1098, in _ddp_init_helper
self.reducer = dist.Reducer(
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 280.00 MiB. GPU 3 has a total capacty of 79.33 GiB of which 259.81 MiB is free. Including non-PyTorch memory, this process has 79.06 GiB memory in use. Of the allocated memory 78.04 GiB is allocated by PyTorch, and 23.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Traceback (most recent call last):
File "pretraining.py", line 767, in <module>
main()
File "pretraining.py", line 728, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/transformers/trainer.py", line 1537, in train
return inner_training_loop(
File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/transformers/trainer.py", line 1672, in _inner_training_loop
model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/accelerate/accelerator.py", line 1227, in prepare
result = tuple(
File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/accelerate/accelerator.py", line 1228, in <genexpr>
self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/accelerate/accelerator.py", line 1104, in _prepare_one
return self.prepare_model(obj, device_placement=device_placement)
File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/accelerate/accelerator.py", line 1355, in prepare_model
model = torch.nn.parallel.DistributedDataParallel(
File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 809, in __init__
self._ddp_init_helper(
File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1098, in _ddp_init_helper
self.reducer = dist.Reducer(
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 98.00 MiB. GPU 0 has a total capacty of 79.33 GiB of which 45.81 MiB is free. Including non-PyTorch memory, this process has 79.27 GiB memory in use. Of the allocated memory 78.32 GiB is allocated by PyTorch, and 53.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Traceback (most recent call last):
File "pretraining.py", line 767, in <module>
main()
File "pretraining.py", line 728, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/transformers/trainer.py", line 1537, in train
return inner_training_loop(
File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/transformers/trainer.py", line 1672, in _inner_training_loop
model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/accelerate/accelerator.py", line 1227, in prepare
result = tuple(
File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/accelerate/accelerator.py", line 1228, in <genexpr>
self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/accelerate/accelerator.py", line 1104, in _prepare_one
return self.prepare_model(obj, device_placement=device_placement)
File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/accelerate/accelerator.py", line 1355, in prepare_model
model = torch.nn.parallel.DistributedDataParallel(
File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 809, in __init__
self._ddp_init_helper(
File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1098, in _ddp_init_helper
self.reducer = dist.Reducer(
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 280.00 MiB. GPU 4 has a total capacty of 79.33 GiB of which 229.81 MiB is free. Including non-PyTorch memory, this process has 79.09 GiB memory in use. Of the allocated memory 78.04 GiB is allocated by PyTorch, and 53.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Traceback (most recent call last):
File "pretraining.py", line 767, in <module>
main()
File "pretraining.py", line 728, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/transformers/trainer.py", line 1537, in train
return inner_training_loop(
File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/transformers/trainer.py", line 1672, in _inner_training_loop
model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/accelerate/accelerator.py", line 1227, in prepare
result = tuple(
File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/accelerate/accelerator.py", line 1228, in <genexpr>
self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/accelerate/accelerator.py", line 1104, in _prepare_one
return self.prepare_model(obj, device_placement=device_placement)
File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/accelerate/accelerator.py", line 1355, in prepare_model
model = torch.nn.parallel.DistributedDataParallel(
File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 809, in __init__
self._ddp_init_helper(
File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1098, in _ddp_init_helper
self.reducer = dist.Reducer(
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 280.00 MiB. GPU 2 has a total capacty of 79.33 GiB of which 259.81 MiB is free. Including non-PyTorch memory, this process has 79.06 GiB memory in use. Of the allocated memory 78.04 GiB is allocated by PyTorch, and 23.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
[2024-01-24 10:35:16,640] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 3467576) of binary: /home/centos/anaconda3/envs/cpt/bin/python
Traceback (most recent call last):
File "/home/centos/anaconda3/envs/cpt1/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/torch/distributed/run.py", line 806, in main
run(args)
File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
pretraining.py FAILED
------------------------------------------------------------
Failures:
[1]:
time : 2024-01-24_10:35:16
host : host188
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 3467577)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
time : 2024-01-24_10:35:16
host : host188
rank : 2 (local_rank: 2)
exitcode : 1 (pid: 3467578)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
time : 2024-01-24_10:35:16
host : host188
rank : 3 (local_rank: 3)
exitcode : 1 (pid: 3467579)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[4]:
time : 2024-01-24_10:35:16
host : host188
rank : 4 (local_rank: 4)
exitcode : 1 (pid: 3467580)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[5]:
time : 2024-01-24_10:35:16
host : host188
rank : 5 (local_rank: 5)
exitcode : 1 (pid: 3467581)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[6]:
time : 2024-01-24_10:35:16
host : host188
rank : 6 (local_rank: 6)
exitcode : 1 (pid: 3467582)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[7]:
time : 2024-01-24_10:35:16
host : host188
rank : 7 (local_rank: 7)
exitcode : 1 (pid: 3467583)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-01-24_10:35:16
host : host188
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 3467576)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Describe the bug
机器配置:8*80G A800卡 按照作者issue里面的提示:修改了deepspeed_zero_stage2_config.json的,
使用chatglm-6B的小一些模型,通过deepspeed方式,全量训练,无论zero1还是zero2都可以run 如果改为:Yi-34B-Chat ,全量训练,无论zero1还是zero2都报GPU不足 难道一台8卡的服务器显卡还不足在34B的全量增量预训练吗? run_pt.sh 配置:
报错如下:
如果不是代码问题,需要几个机器才能run起来呢?