Closed xbasly closed 1 year ago
Hi @xbasly,
I don't have any problem when running with the configuration, gpt2_zero.py. Could you tell me more about your configuration?
I am experiencing the same issue when training a T5 model from Huggingface. Version of colossalai: 0.1.8+torch1.10cu10.2
I am experiencing the same issue when training a T5 model from Huggingface. Version of colossalai: 0.1.8+torch1.10cu10.2
Did you solve it? I have the same environmental version as you
@xbasly Did you solve you problem? Can you guys provide a reproducible script for us?
@feifeibear I haven't solved Version of colossalai: 0.1.8+torch1.18cuda10.2, The video card is V100. Basically all scripts are the same as example.The data source is slightly different. my train_gpt.py:
`In the main function,The data source is modified. Other content is the same as that in the example.
parser = colossalai.get_default_parser()
parser.add_argument('--from_torch', default=False, action='store_true')
args = parser.parse_args()
args.data_dir = '/dataset/data_dir/'
args.OUTPUT_DIR = '/schedule-train/output'
disable_existing_loggers()
if args.from_torch:
colossalai.launch_from_torch(config=args.config)
else:
colossalai.launch_from_slurm(config=args.config, host=args.host, port=29500, seed=42)
logger = get_dist_logger()
logger.info('Build data loader', ranks=[0])
logger.info('data_dir: ' + args.data_dir)
data_file = os.path.join(args.data_dir, "a.txt")
train_ds = QueryDataset(data_file, seq_len=gpc.config.SEQ_LEN)`
QueryDataset:
` from colossalai.registry import DATASETS from torch.utils.data import Dataset from transformers import GPT2Tokenizer
@DATASETS.register_module
class QueryDataset(Dataset):
def __init__(self, path, seq_len=1024) -> None:
super().__init__()
self.seq_len = seq_len
with open(path, encoding="utf-8") as f:
lines = f.read().splitlines()
self.data = lines
self.tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
self.tokenizer.pad_token = self.tokenizer.unk_token
def __len__(self):
return len(self.data)
def __getitem__(self, index):
encoded_data = self.tokenizer(self.data[index],
padding="max_length",
truncation=True,
max_length=self.seq_len,
return_tensors='pt')
return {'input_ids': encoded_data['input_ids'][0],
'attention_mask': encoded_data['attention_mask'][0]}, encoded_data['input_ids'][0]`
my sh: python /schedule-train/algorithm/torchrun.py --nproc_per_node=2 /schedule-train/algorithm/train_gpt.py --config /schedule-train/algorithm/gpt2_configs/gpt2_zero3.py --from_torch
torchrun.py:
import re import sys from torch.distributed.run import main
if name == 'main': sys.argv[0] = re.sub(r'(-script.pyw|.exe)?$', '', sys.argv[0]) sys.exit(main())
My data is just a few English sentences. About 20-30 lines.Can you tell me when the problem of inf arises?
Hi @xbasly,
I have tried to substitute the Dataset with your own QueryDataset and use the below script to generate my sample.txt. Though there is a warning for the discovery of INF, it didn't raise any error mentioned above. I established the experiment with the latest colossalai library and examples.
def main():
for _ in range(50):
with open('./sample.txt', mode='a', encoding='utf-8') as f:
f.write("I am your father\n")
if __name__ == '__main__':
main()
Hello, I confirmed this issue is persisted. My setup GPT-2 with Zero3 Using custom dataset Config:
from colossalai.zero.shard_utils import TensorShardStrategy
from titans.model.gpt import gpt2_large
from colossalai.amp import AMP_TYPE
BATCH_SIZE = 1
NUM_EPOCHS = 1
SEQ_LEN = 1024
zero = dict(
model_config=dict(
tensor_placement_policy='auto',
shard_strategy=TensorShardStrategy(),
reuse_fp16_shard=True
),
optimizer_config=dict()
)
optimizer = dict(
type=HybridAdam,
lr=0.00015,
weight_decay=1e-2,
)
model = dict(
type=gpt2_large,
checkpoint=True,
)
The error:
/usr/local/lib/python3.7/dist-packages/colossalai/z
ero/sharded_optim/sharded_optim_v2.py:197 step
WARNING colossalai - ShardedOptimizerV2 - WARNING: found
inf during ShardedOptimV2 step
[Epoch 0 / Train]: 0% 411/788222 [28:13<880:51:13, 4.03s/it, loss=nan, lr=2.5e-5, throughput=0.23098 sample_per_sec, 1.8909 Tflops][09/30/22 02:45:08] WARNING colossalai - ShardedOptimizerV2 - WARNING:
/usr/local/lib/python3.7/dist-packages/colossalai/z
ero/sharded_optim/sharded_optim_v2.py:197 step
WARNING colossalai - ShardedOptimizerV2 - WARNING: found
inf during ShardedOptimV2 step
[Epoch 0 / Train]: 0% 412/788222 [28:17<866:40:57, 3.96s/it, loss=nan, lr=2.5e-5, throughput=0.26257 sample_per_sec, 2.1496 Tflops][09/30/22 02:45:13] WARNING colossalai - ShardedOptimizerV2 - WARNING:
/usr/local/lib/python3.7/dist-packages/colossalai/z
ero/sharded_optim/sharded_optim_v2.py:197 step
WARNING colossalai - ShardedOptimizerV2 - WARNING: found
inf during ShardedOptimV2 step
[Epoch 0 / Train]: 0% 413/788222 [28:21<882:08:55, 4.03s/it, loss=nan, lr=2.5e-5, throughput=0.23837 sample_per_sec, 1.9514 Tflops][09/30/22 02:45:16] WARNING colossalai - ShardedOptimizerV2 - WARNING:
/usr/local/lib/python3.7/dist-packages/colossalai/z
ero/sharded_optim/sharded_optim_v2.py:197 step
WARNING colossalai - ShardedOptimizerV2 - WARNING: found
inf during ShardedOptimV2 step
[Epoch 0 / Train]: 0% 414/788222 [28:25<859:06:53, 3.93s/it, loss=nan, lr=2.5e-5, throughput=0.27178 sample_per_sec, 2.225 Tflops][09/30/22 02:45:20] WARNING colossalai - ShardedOptimizerV2 - WARNING:
/usr/local/lib/python3.7/dist-packages/colossalai/z
ero/sharded_optim/sharded_optim_v2.py:197 step
WARNING colossalai - ShardedOptimizerV2 - WARNING: found
inf during ShardedOptimV2 step
[Epoch 0 / Train]: 0% 415/788222 [28:29<854:16:41, 3.90s/it, loss=nan, lr=2.5e-5, throughput=0.25965 sample_per_sec, 2.1256 Tflops][09/30/22 02:45:24] WARNING colossalai - ShardedOptimizerV2 - WARNING:
/usr/local/lib/python3.7/dist-packages/colossalai/z
ero/sharded_optim/sharded_optim_v2.py:197 step
WARNING colossalai - ShardedOptimizerV2 - WARNING: found
inf during ShardedOptimV2 step
[Epoch 0 / Train]: 0% 416/788222 [28:33<867:29:00, 3.96s/it, loss=nan, lr=2.5e-5, throughput=0.24366 sample_per_sec, 1.9947 Tflops][09/30/22 02:45:28] WARNING colossalai - ShardedOptimizerV2 - WARNING:
/usr/local/lib/python3.7/dist-packages/colossalai/z
ero/sharded_optim/sharded_optim_v2.py:197 step
WARNING colossalai - ShardedOptimizerV2 - WARNING: found
inf during ShardedOptimV2 step
[Epoch 0 / Train]: 0% 417/788222 [28:36<837:36:59, 3.83s/it, loss=nan, lr=2.5e-5, throughput=0.28504 sample_per_sec, 2.3335 Tflops][09/30/22 02:45:31] WARNING colossalai - ShardedOptimizerV2 - WARNING:
/usr/local/lib/python3.7/dist-packages/colossalai/z
ero/sharded_optim/sharded_optim_v2.py:197 step
WARNING colossalai - ShardedOptimizerV2 - WARNING: found
inf during ShardedOptimV2 step
[Epoch 0 / Train]: 0% 418/788222 [28:40<826:30:42, 3.78s/it, loss=nan, lr=2.5e-5, throughput=0.2734 sample_per_sec, 2.2382 Tflops][09/30/22 02:45:35] WARNING colossalai - ShardedOptimizerV2 - WARNING:
/usr/local/lib/python3.7/dist-packages/colossalai/z
ero/sharded_optim/sharded_optim_v2.py:197 step
WARNING colossalai - ShardedOptimizerV2 - WARNING: found
inf during ShardedOptimV2 step
[Epoch 0 / Train]: 0% 419/788222 [28:44<839:59:59, 3.84s/it, loss=nan, lr=2.5e-5, throughput=0.25116 sample_per_sec, 2.0561 Tflops][09/30/22 02:45:39] WARNING colossalai - ShardedOptimizerV2 - WARNING:
/usr/local/lib/python3.7/dist-packages/colossalai/z
ero/sharded_optim/sharded_optim_v2.py:197 step
WARNING colossalai - ShardedOptimizerV2 - WARNING: found
inf during ShardedOptimV2 step
[Epoch 0 / Train]: 0% 420/788222 [28:47<810:51:33, 3.71s/it, loss=nan, lr=2.5e-5, throughput=0.29466 sample_per_sec, 2.4123 Tflops][09/30/22 02:45:43] WARNING colossalai - ShardedOptimizerV2 - WARNING:
/usr/local/lib/python3.7/dist-packages/colossalai/z
ero/sharded_optim/sharded_optim_v2.py:197 step
WARNING colossalai - ShardedOptimizerV2 - WARNING: found
inf during ShardedOptimV2 step
[Epoch 0 / Train]: 0% 421/788222 [28:51<811:51:23, 3.71s/it, loss=nan, lr=2.5e-5, throughput=0.26885 sample_per_sec, 2.2009 Tflops][09/30/22 02:45:46] WARNING colossalai - ShardedOptimizerV2 - WARNING:
/usr/local/lib/python3.7/dist-packages/colossalai/z
ero/sharded_optim/sharded_optim_v2.py:197 step
WARNING colossalai - ShardedOptimizerV2 - WARNING: found
inf during ShardedOptimV2 step
[Epoch 0 / Train]: 0% 422/788222 [28:55<797:47:03, 3.65s/it, loss=nan, lr=2.5e-5, throughput=0.28615 sample_per_sec, 2.3426 Tflops][09/30/22 02:45:50] WARNING colossalai - ShardedOptimizerV2 - WARNING:
/usr/local/lib/python3.7/dist-packages/colossalai/z
ero/sharded_optim/sharded_optim_v2.py:197 step
WARNING colossalai - ShardedOptimizerV2 - WARNING: found
inf during ShardedOptimV2 step
[Epoch 0 / Train]: 0% 423/788222 [28:58<803:14:12, 3.67s/it, loss=nan, lr=2.5e-5, throughput=0.26825 sample_per_sec, 2.196 Tflops][09/30/22 02:45:54] WARNING colossalai - ShardedOptimizerV2 - WARNING:
/usr/local/lib/python3.7/dist-packages/colossalai/z
ero/sharded_optim/sharded_optim_v2.py:197 step
WARNING colossalai - ShardedOptimizerV2 - WARNING: found
inf during ShardedOptimV2 step
[Epoch 0 / Train]: 0% 424/788222 [29:02<808:59:22, 3.70s/it, loss=nan, lr=2.5e-5, throughput=0.26614 sample_per_sec, 2.1788 Tflops][09/30/22 02:45:57] WARNING colossalai - ShardedOptimizerV2 - WARNING:
/usr/local/lib/python3.7/dist-packages/colossalai/z
ero/sharded_optim/sharded_optim_v2.py:197 step
WARNING colossalai - ShardedOptimizerV2 - WARNING: found
inf during ShardedOptimV2 step
[Epoch 0 / Train]: 0% 425/788222 [29:06<813:17:07, 3.72s/it, loss=nan, lr=2.5e-5, throughput=0.26586 sample_per_sec, 2.1764 Tflops][09/30/22 02:46:01] WARNING colossalai - ShardedOptimizerV2 - WARNING:
/usr/local/lib/python3.7/dist-packages/colossalai/z
ero/sharded_optim/sharded_optim_v2.py:197 step
WARNING colossalai - ShardedOptimizerV2 - WARNING: found
inf during ShardedOptimV2 step
[Epoch 0 / Train]: 0% 426/788222 [29:09<795:46:30, 3.64s/it, loss=nan, lr=2.5e-5, throughput=0.28994 sample_per_sec, 2.3736 Tflops][09/30/22 02:46:05] WARNING colossalai - ShardedOptimizerV2 - WARNING:
/usr/local/lib/python3.7/dist-packages/colossalai/z
ero/sharded_optim/sharded_optim_v2.py:197 step
WARNING colossalai - ShardedOptimizerV2 - WARNING: found
inf during ShardedOptimV2 step
[Epoch 0 / Train]: 0% 427/788222 [29:13<810:27:33, 3.70s/it, loss=nan, lr=2.5e-5, throughput=0.25911 sample_per_sec, 2.1213 Tflops][09/30/22 02:46:08] WARNING colossalai - ShardedOptimizerV2 - WARNING:
/usr/local/lib/python3.7/dist-packages/colossalai/z
ero/sharded_optim/sharded_optim_v2.py:197 step
WARNING colossalai - ShardedOptimizerV2 - WARNING: found
inf during ShardedOptimV2 step
[Epoch 0 / Train]: 0% 428/788222 [29:17<814:40:59, 3.72s/it, loss=nan, lr=2.5e-5, throughput=0.26546 sample_per_sec, 2.1732 Tflops][09/30/22 02:46:12] WARNING colossalai - ShardedOptimizerV2 - WARNING:
/usr/local/lib/python3.7/dist-packages/colossalai/z
ero/sharded_optim/sharded_optim_v2.py:197 step
WARNING colossalai - ShardedOptimizerV2 - WARNING: found
inf during ShardedOptimV2 step
[Epoch 0 / Train]: 0% 429/788222 [29:21<817:52:04, 3.74s/it, loss=nan, lr=2.5e-5, throughput=0.26521 sample_per_sec, 2.1712 Tflops][09/30/22 02:46:16] WARNING colossalai - ShardedOptimizerV2 - WARNING:
/usr/local/lib/python3.7/dist-packages/colossalai/z
ero/sharded_optim/sharded_optim_v2.py:197 step
WARNING colossalai - ShardedOptimizerV2 - WARNING: found
inf during ShardedOptimV2 step
[Epoch 0 / Train]: 0% 430/788222 [29:24<808:51:08, 3.70s/it, loss=nan, lr=2.5e-5, throughput=0.27783 sample_per_sec, 2.2745 Tflops][09/30/22 02:46:19] WARNING colossalai - ShardedOptimizerV2 - WARNING:
/usr/local/lib/python3.7/dist-packages/colossalai/z
ero/sharded_optim/sharded_optim_v2.py:197 step
WARNING colossalai - ShardedOptimizerV2 - WARNING: found
inf during ShardedOptimV2 step
[Epoch 0 / Train]: 0% 431/788222 [29:28<796:09:33, 3.64s/it, loss=nan, lr=2.5e-5, throughput=0.28555 sample_per_sec, 2.3376 Tflops][09/30/22 02:46:23] WARNING colossalai - ShardedOptimizerV2 - WARNING:
/usr/local/lib/python3.7/dist-packages/colossalai/z
ero/sharded_optim/sharded_optim_v2.py:197 step
WARNING colossalai - ShardedOptimizerV2 - WARNING: found
inf during ShardedOptimV2 step
[Epoch 0 / Train]: 0% 432/788222 [29:31<786:20:54, 3.59s/it, loss=nan, lr=2.5e-5, throughput=0.2867 sample_per_sec, 2.3471 Tflops][09/30/22 02:46:26] WARNING colossalai - ShardedOptimizerV2 - WARNING:
/usr/local/lib/python3.7/dist-packages/colossalai/z
ero/sharded_optim/sharded_optim_v2.py:197 step
WARNING colossalai - ShardedOptimizerV2 - WARNING: found
inf during ShardedOptimV2 step
[Epoch 0 / Train]: 0% 433/788222 [29:35<779:49:00, 3.56s/it, loss=nan, lr=2.5e-5, throughput=0.28628 sample_per_sec,
loss=nan
keep forever for every iteration. And this would happen randomly after 1000 iterations or 10k iterations or even a few hundreds.
We have updated a lot. This issue was closed due to inactivity. Thanks.
🐛 Describe the bug
The configuration information uses the example, and only the data source is changed. Configurations from 1d to pp work properly, but an error is reported when you run zero3.
WARNING colossalai - ShardedOptimizerV2 - WARNING: found
File "/opt/zzz/schedule-train/algorithm/train_gpt.py", line 171, in main
main()
File "/opt/zzz/schedule-train/algorithm/train_gpt.py", line 171, in main
trainer.fit(train_dataloader=train_dataloader,
File "/usr/local/python3/lib/python3.8/site-packages/colossalai/trainer/_trainer.py", line 321, in fit
self._train_epoch(
File "/usr/local/python3/lib/python3.8/site-packages/colossalai/trainer/_trainer.py", line 187, in _train_epoch
self.engine.step()
File "/usr/local/python3/lib/python3.8/site-packages/colossalai/engine/_base_engine.py", line 155, in step
return self.optimizer.step()
File "/usr/local/python3/lib/python3.8/site-packages/colossalai/zero/sharded_optim/sharded_optim_v2.py", line 198, in step
self._zero_grad(recover_data=True)
File "/usr/local/python3/lib/python3.8/site-packages/colossalai/zero/sharded_optim/sharded_optim_v2.py", line 259, in _zero_grad
self._copy_master_param_to_param_fp16(p)
File "/usr/local/python3/lib/python3.8/site-packages/colossalai/zero/sharded_optim/sharded_optim_v2.py", line 343, in _copy_master_param_to_param_fp16
p.colo_attr.sharded_data_tensor.payload_relay(p.colo_attr.saved_grad)
File "/usr/local/python3/lib/python3.8/site-packages/colossalai/gemini/stateful_tensor.py", line 119, in payload_relay
trainer.fit(train_dataloader=train_dataloader,
File "/usr/local/python3/lib/python3.8/site-packages/colossalai/trainer/_trainer.py", line 321, in fit
self._train_epoch(
File "/usr/local/python3/lib/python3.8/site-packages/colossalai/trainer/_trainer.py", line 187, in _train_epoch
self.engine.step()
File "/usr/local/python3/lib/python3.8/site-packages/colossalai/engine/_base_engine.py", line 155, in step
return self.optimizer.step()
File "/usr/local/python3/lib/python3.8/site-packages/colossalai/zero/sharded_optim/sharded_optim_v2.py", line 198, in step
self._zero_grad(recover_data=True)
File "/usr/local/python3/lib/python3.8/site-packages/colossalai/zero/sharded_optim/sharded_optim_v2.py", line 259, in _zero_grad
self._copy_master_param_to_param_fp16(p)
File "/usr/local/python3/lib/python3.8/site-packages/colossalai/zero/sharded_optim/sharded_optim_v2.py", line 343, in _copy_master_param_to_param_fp16
assert not rhs.is_null()
AssertionError
p.colo_attr.sharded_data_tensor.payload_relay(p.colo_attr.saved_grad)
File "/usr/local/python3/lib/python3.8/site-packages/colossalai/gemini/stateful_tensor.py", line 119, in payload_relay
assert not rhs.is_null()
inf during ShardedOptimV2 step
Traceback (most recent call last): main() File "/opt/zzz/schedule-train/algorithm/train_gpt.py", line 181, in
Environment
latest