hpcaitech / ColossalAI

Making large AI models cheaper, faster and more accessible
https://www.colossalai.org
Apache License 2.0
38.78k stars 4.34k forks source link

[BUG]: found inf during ShardedOptimV2 step #1375

Closed xbasly closed 1 year ago

xbasly commented 2 years ago

🐛 Describe the bug

The configuration information uses the example, and only the data source is changed. Configurations from 1d to pp work properly, but an error is reported when you run zero3.

WARNING colossalai - ShardedOptimizerV2 - WARNING: found
inf during ShardedOptimV2 step
Traceback (most recent call last): main() File "/opt/zzz/schedule-train/algorithm/train_gpt.py", line 181, in File "/opt/zzz/schedule-train/algorithm/train_gpt.py", line 171, in main main() File "/opt/zzz/schedule-train/algorithm/train_gpt.py", line 171, in main trainer.fit(train_dataloader=train_dataloader, File "/usr/local/python3/lib/python3.8/site-packages/colossalai/trainer/_trainer.py", line 321, in fit self._train_epoch( File "/usr/local/python3/lib/python3.8/site-packages/colossalai/trainer/_trainer.py", line 187, in _train_epoch self.engine.step() File "/usr/local/python3/lib/python3.8/site-packages/colossalai/engine/_base_engine.py", line 155, in step return self.optimizer.step() File "/usr/local/python3/lib/python3.8/site-packages/colossalai/zero/sharded_optim/sharded_optim_v2.py", line 198, in step self._zero_grad(recover_data=True) File "/usr/local/python3/lib/python3.8/site-packages/colossalai/zero/sharded_optim/sharded_optim_v2.py", line 259, in _zero_grad self._copy_master_param_to_param_fp16(p) File "/usr/local/python3/lib/python3.8/site-packages/colossalai/zero/sharded_optim/sharded_optim_v2.py", line 343, in _copy_master_param_to_param_fp16 p.colo_attr.sharded_data_tensor.payload_relay(p.colo_attr.saved_grad) File "/usr/local/python3/lib/python3.8/site-packages/colossalai/gemini/stateful_tensor.py", line 119, in payload_relay trainer.fit(train_dataloader=train_dataloader, File "/usr/local/python3/lib/python3.8/site-packages/colossalai/trainer/_trainer.py", line 321, in fit self._train_epoch( File "/usr/local/python3/lib/python3.8/site-packages/colossalai/trainer/_trainer.py", line 187, in _train_epoch self.engine.step() File "/usr/local/python3/lib/python3.8/site-packages/colossalai/engine/_base_engine.py", line 155, in step return self.optimizer.step() File "/usr/local/python3/lib/python3.8/site-packages/colossalai/zero/sharded_optim/sharded_optim_v2.py", line 198, in step self._zero_grad(recover_data=True) File "/usr/local/python3/lib/python3.8/site-packages/colossalai/zero/sharded_optim/sharded_optim_v2.py", line 259, in _zero_grad self._copy_master_param_to_param_fp16(p) File "/usr/local/python3/lib/python3.8/site-packages/colossalai/zero/sharded_optim/sharded_optim_v2.py", line 343, in _copy_master_param_to_param_fp16 assert not rhs.is_null() AssertionError p.colo_attr.sharded_data_tensor.payload_relay(p.colo_attr.saved_grad) File "/usr/local/python3/lib/python3.8/site-packages/colossalai/gemini/stateful_tensor.py", line 119, in payload_relay assert not rhs.is_null()

Environment

latest

1SAA commented 2 years ago

Hi @xbasly,

I don't have any problem when running with the configuration, gpt2_zero.py. Could you tell me more about your configuration?

AntoineBlanot commented 2 years ago

I am experiencing the same issue when training a T5 model from Huggingface. Version of colossalai: 0.1.8+torch1.10cu10.2

xbasly commented 2 years ago

I am experiencing the same issue when training a T5 model from Huggingface. Version of colossalai: 0.1.8+torch1.10cu10.2

Did you solve it? I have the same environmental version as you

feifeibear commented 2 years ago

@xbasly Did you solve you problem? Can you guys provide a reproducible script for us?

xbasly commented 2 years ago

@feifeibear I haven't solved Version of colossalai: 0.1.8+torch1.18cuda10.2, The video card is V100. Basically all scripts are the same as example.The data source is slightly different. my train_gpt.py:

`In the main function,The data source is modified. Other content is the same as that in the example.

parser = colossalai.get_default_parser()
parser.add_argument('--from_torch', default=False, action='store_true')
args = parser.parse_args()
args.data_dir = '/dataset/data_dir/'
args.OUTPUT_DIR = '/schedule-train/output'
disable_existing_loggers()
if args.from_torch:
    colossalai.launch_from_torch(config=args.config)
else:
    colossalai.launch_from_slurm(config=args.config, host=args.host, port=29500, seed=42)
logger = get_dist_logger()
logger.info('Build data loader', ranks=[0])
logger.info('data_dir: ' + args.data_dir)
data_file = os.path.join(args.data_dir, "a.txt")
train_ds = QueryDataset(data_file, seq_len=gpc.config.SEQ_LEN)`

QueryDataset:

` from colossalai.registry import DATASETS from torch.utils.data import Dataset from transformers import GPT2Tokenizer

@DATASETS.register_module
class QueryDataset(Dataset):

def __init__(self, path, seq_len=1024) -> None:
    super().__init__()
    self.seq_len = seq_len
    with open(path, encoding="utf-8") as f:
        lines = f.read().splitlines()
    self.data = lines
    self.tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
    self.tokenizer.pad_token = self.tokenizer.unk_token

def __len__(self):
    return len(self.data)

def __getitem__(self, index):
    encoded_data = self.tokenizer(self.data[index],
                                  padding="max_length",
                                  truncation=True,
                                  max_length=self.seq_len,
                                  return_tensors='pt')
    return {'input_ids': encoded_data['input_ids'][0],
            'attention_mask': encoded_data['attention_mask'][0]}, encoded_data['input_ids'][0]`

my sh: python /schedule-train/algorithm/torchrun.py --nproc_per_node=2 /schedule-train/algorithm/train_gpt.py --config /schedule-train/algorithm/gpt2_configs/gpt2_zero3.py --from_torch

torchrun.py:

!/usr/bin/env python

coding=UTF-8

import re import sys from torch.distributed.run import main

if name == 'main': sys.argv[0] = re.sub(r'(-script.pyw|.exe)?$', '', sys.argv[0]) sys.exit(main())

My data is just a few English sentences. About 20-30 lines.Can you tell me when the problem of inf arises?

1SAA commented 2 years ago

Hi @xbasly,

I have tried to substitute the Dataset with your own QueryDataset and use the below script to generate my sample.txt. Though there is a warning for the discovery of INF, it didn't raise any error mentioned above. I established the experiment with the latest colossalai library and examples.

def main():
    for _ in range(50):
        with open('./sample.txt', mode='a', encoding='utf-8') as f:
            f.write("I am your father\n")

if __name__ == '__main__':
    main()
tientr commented 2 years ago

Hello, I confirmed this issue is persisted. My setup GPT-2 with Zero3 Using custom dataset Config:

from colossalai.zero.shard_utils import TensorShardStrategy
from titans.model.gpt import gpt2_large
from colossalai.amp import AMP_TYPE

BATCH_SIZE = 1
NUM_EPOCHS = 1
SEQ_LEN = 1024

zero = dict(
    model_config=dict(
        tensor_placement_policy='auto',
        shard_strategy=TensorShardStrategy(),
        reuse_fp16_shard=True
    ),
    optimizer_config=dict()
)

optimizer = dict(
    type=HybridAdam,
    lr=0.00015,
    weight_decay=1e-2,
)

model = dict(
    type=gpt2_large,
    checkpoint=True,
)

The error:

                             /usr/local/lib/python3.7/dist-packages/colossalai/z
                             ero/sharded_optim/sharded_optim_v2.py:197 step     
                    WARNING  colossalai - ShardedOptimizerV2 - WARNING: found   
                             inf during ShardedOptimV2 step                     
[Epoch 0 / Train]:   0% 411/788222 [28:13<880:51:13,  4.03s/it, loss=nan, lr=2.5e-5, throughput=0.23098 sample_per_sec, 1.8909 Tflops][09/30/22 02:45:08] WARNING  colossalai - ShardedOptimizerV2 - WARNING:         
                             /usr/local/lib/python3.7/dist-packages/colossalai/z
                             ero/sharded_optim/sharded_optim_v2.py:197 step     
                    WARNING  colossalai - ShardedOptimizerV2 - WARNING: found   
                             inf during ShardedOptimV2 step                     
[Epoch 0 / Train]:   0% 412/788222 [28:17<866:40:57,  3.96s/it, loss=nan, lr=2.5e-5, throughput=0.26257 sample_per_sec, 2.1496 Tflops][09/30/22 02:45:13] WARNING  colossalai - ShardedOptimizerV2 - WARNING:         
                             /usr/local/lib/python3.7/dist-packages/colossalai/z
                             ero/sharded_optim/sharded_optim_v2.py:197 step     
                    WARNING  colossalai - ShardedOptimizerV2 - WARNING: found   
                             inf during ShardedOptimV2 step                     
[Epoch 0 / Train]:   0% 413/788222 [28:21<882:08:55,  4.03s/it, loss=nan, lr=2.5e-5, throughput=0.23837 sample_per_sec, 1.9514 Tflops][09/30/22 02:45:16] WARNING  colossalai - ShardedOptimizerV2 - WARNING:         
                             /usr/local/lib/python3.7/dist-packages/colossalai/z
                             ero/sharded_optim/sharded_optim_v2.py:197 step     
                    WARNING  colossalai - ShardedOptimizerV2 - WARNING: found   
                             inf during ShardedOptimV2 step                     
[Epoch 0 / Train]:   0% 414/788222 [28:25<859:06:53,  3.93s/it, loss=nan, lr=2.5e-5, throughput=0.27178 sample_per_sec, 2.225 Tflops][09/30/22 02:45:20] WARNING  colossalai - ShardedOptimizerV2 - WARNING:         
                             /usr/local/lib/python3.7/dist-packages/colossalai/z
                             ero/sharded_optim/sharded_optim_v2.py:197 step     
                    WARNING  colossalai - ShardedOptimizerV2 - WARNING: found   
                             inf during ShardedOptimV2 step                     
[Epoch 0 / Train]:   0% 415/788222 [28:29<854:16:41,  3.90s/it, loss=nan, lr=2.5e-5, throughput=0.25965 sample_per_sec, 2.1256 Tflops][09/30/22 02:45:24] WARNING  colossalai - ShardedOptimizerV2 - WARNING:         
                             /usr/local/lib/python3.7/dist-packages/colossalai/z
                             ero/sharded_optim/sharded_optim_v2.py:197 step     
                    WARNING  colossalai - ShardedOptimizerV2 - WARNING: found   
                             inf during ShardedOptimV2 step                     
[Epoch 0 / Train]:   0% 416/788222 [28:33<867:29:00,  3.96s/it, loss=nan, lr=2.5e-5, throughput=0.24366 sample_per_sec, 1.9947 Tflops][09/30/22 02:45:28] WARNING  colossalai - ShardedOptimizerV2 - WARNING:         
                             /usr/local/lib/python3.7/dist-packages/colossalai/z
                             ero/sharded_optim/sharded_optim_v2.py:197 step     
                    WARNING  colossalai - ShardedOptimizerV2 - WARNING: found   
                             inf during ShardedOptimV2 step                     
[Epoch 0 / Train]:   0% 417/788222 [28:36<837:36:59,  3.83s/it, loss=nan, lr=2.5e-5, throughput=0.28504 sample_per_sec, 2.3335 Tflops][09/30/22 02:45:31] WARNING  colossalai - ShardedOptimizerV2 - WARNING:         
                             /usr/local/lib/python3.7/dist-packages/colossalai/z
                             ero/sharded_optim/sharded_optim_v2.py:197 step     
                    WARNING  colossalai - ShardedOptimizerV2 - WARNING: found   
                             inf during ShardedOptimV2 step                     
[Epoch 0 / Train]:   0% 418/788222 [28:40<826:30:42,  3.78s/it, loss=nan, lr=2.5e-5, throughput=0.2734 sample_per_sec, 2.2382 Tflops][09/30/22 02:45:35] WARNING  colossalai - ShardedOptimizerV2 - WARNING:         
                             /usr/local/lib/python3.7/dist-packages/colossalai/z
                             ero/sharded_optim/sharded_optim_v2.py:197 step     
                    WARNING  colossalai - ShardedOptimizerV2 - WARNING: found   
                             inf during ShardedOptimV2 step                     
[Epoch 0 / Train]:   0% 419/788222 [28:44<839:59:59,  3.84s/it, loss=nan, lr=2.5e-5, throughput=0.25116 sample_per_sec, 2.0561 Tflops][09/30/22 02:45:39] WARNING  colossalai - ShardedOptimizerV2 - WARNING:         
                             /usr/local/lib/python3.7/dist-packages/colossalai/z
                             ero/sharded_optim/sharded_optim_v2.py:197 step     
                    WARNING  colossalai - ShardedOptimizerV2 - WARNING: found   
                             inf during ShardedOptimV2 step                     
[Epoch 0 / Train]:   0% 420/788222 [28:47<810:51:33,  3.71s/it, loss=nan, lr=2.5e-5, throughput=0.29466 sample_per_sec, 2.4123 Tflops][09/30/22 02:45:43] WARNING  colossalai - ShardedOptimizerV2 - WARNING:         
                             /usr/local/lib/python3.7/dist-packages/colossalai/z
                             ero/sharded_optim/sharded_optim_v2.py:197 step     
                    WARNING  colossalai - ShardedOptimizerV2 - WARNING: found   
                             inf during ShardedOptimV2 step                     
[Epoch 0 / Train]:   0% 421/788222 [28:51<811:51:23,  3.71s/it, loss=nan, lr=2.5e-5, throughput=0.26885 sample_per_sec, 2.2009 Tflops][09/30/22 02:45:46] WARNING  colossalai - ShardedOptimizerV2 - WARNING:         
                             /usr/local/lib/python3.7/dist-packages/colossalai/z
                             ero/sharded_optim/sharded_optim_v2.py:197 step     
                    WARNING  colossalai - ShardedOptimizerV2 - WARNING: found   
                             inf during ShardedOptimV2 step                     
[Epoch 0 / Train]:   0% 422/788222 [28:55<797:47:03,  3.65s/it, loss=nan, lr=2.5e-5, throughput=0.28615 sample_per_sec, 2.3426 Tflops][09/30/22 02:45:50] WARNING  colossalai - ShardedOptimizerV2 - WARNING:         
                             /usr/local/lib/python3.7/dist-packages/colossalai/z
                             ero/sharded_optim/sharded_optim_v2.py:197 step     
                    WARNING  colossalai - ShardedOptimizerV2 - WARNING: found   
                             inf during ShardedOptimV2 step                     
[Epoch 0 / Train]:   0% 423/788222 [28:58<803:14:12,  3.67s/it, loss=nan, lr=2.5e-5, throughput=0.26825 sample_per_sec, 2.196 Tflops][09/30/22 02:45:54] WARNING  colossalai - ShardedOptimizerV2 - WARNING:         
                             /usr/local/lib/python3.7/dist-packages/colossalai/z
                             ero/sharded_optim/sharded_optim_v2.py:197 step     
                    WARNING  colossalai - ShardedOptimizerV2 - WARNING: found   
                             inf during ShardedOptimV2 step                     
[Epoch 0 / Train]:   0% 424/788222 [29:02<808:59:22,  3.70s/it, loss=nan, lr=2.5e-5, throughput=0.26614 sample_per_sec, 2.1788 Tflops][09/30/22 02:45:57] WARNING  colossalai - ShardedOptimizerV2 - WARNING:         
                             /usr/local/lib/python3.7/dist-packages/colossalai/z
                             ero/sharded_optim/sharded_optim_v2.py:197 step     
                    WARNING  colossalai - ShardedOptimizerV2 - WARNING: found   
                             inf during ShardedOptimV2 step                     
[Epoch 0 / Train]:   0% 425/788222 [29:06<813:17:07,  3.72s/it, loss=nan, lr=2.5e-5, throughput=0.26586 sample_per_sec, 2.1764 Tflops][09/30/22 02:46:01] WARNING  colossalai - ShardedOptimizerV2 - WARNING:         
                             /usr/local/lib/python3.7/dist-packages/colossalai/z
                             ero/sharded_optim/sharded_optim_v2.py:197 step     
                    WARNING  colossalai - ShardedOptimizerV2 - WARNING: found   
                             inf during ShardedOptimV2 step                     
[Epoch 0 / Train]:   0% 426/788222 [29:09<795:46:30,  3.64s/it, loss=nan, lr=2.5e-5, throughput=0.28994 sample_per_sec, 2.3736 Tflops][09/30/22 02:46:05] WARNING  colossalai - ShardedOptimizerV2 - WARNING:         
                             /usr/local/lib/python3.7/dist-packages/colossalai/z
                             ero/sharded_optim/sharded_optim_v2.py:197 step     
                    WARNING  colossalai - ShardedOptimizerV2 - WARNING: found   
                             inf during ShardedOptimV2 step                     
[Epoch 0 / Train]:   0% 427/788222 [29:13<810:27:33,  3.70s/it, loss=nan, lr=2.5e-5, throughput=0.25911 sample_per_sec, 2.1213 Tflops][09/30/22 02:46:08] WARNING  colossalai - ShardedOptimizerV2 - WARNING:         
                             /usr/local/lib/python3.7/dist-packages/colossalai/z
                             ero/sharded_optim/sharded_optim_v2.py:197 step     
                    WARNING  colossalai - ShardedOptimizerV2 - WARNING: found   
                             inf during ShardedOptimV2 step                     
[Epoch 0 / Train]:   0% 428/788222 [29:17<814:40:59,  3.72s/it, loss=nan, lr=2.5e-5, throughput=0.26546 sample_per_sec, 2.1732 Tflops][09/30/22 02:46:12] WARNING  colossalai - ShardedOptimizerV2 - WARNING:         
                             /usr/local/lib/python3.7/dist-packages/colossalai/z
                             ero/sharded_optim/sharded_optim_v2.py:197 step     
                    WARNING  colossalai - ShardedOptimizerV2 - WARNING: found   
                             inf during ShardedOptimV2 step                     
[Epoch 0 / Train]:   0% 429/788222 [29:21<817:52:04,  3.74s/it, loss=nan, lr=2.5e-5, throughput=0.26521 sample_per_sec, 2.1712 Tflops][09/30/22 02:46:16] WARNING  colossalai - ShardedOptimizerV2 - WARNING:         
                             /usr/local/lib/python3.7/dist-packages/colossalai/z
                             ero/sharded_optim/sharded_optim_v2.py:197 step     
                    WARNING  colossalai - ShardedOptimizerV2 - WARNING: found   
                             inf during ShardedOptimV2 step                     
[Epoch 0 / Train]:   0% 430/788222 [29:24<808:51:08,  3.70s/it, loss=nan, lr=2.5e-5, throughput=0.27783 sample_per_sec, 2.2745 Tflops][09/30/22 02:46:19] WARNING  colossalai - ShardedOptimizerV2 - WARNING:         
                             /usr/local/lib/python3.7/dist-packages/colossalai/z
                             ero/sharded_optim/sharded_optim_v2.py:197 step     
                    WARNING  colossalai - ShardedOptimizerV2 - WARNING: found   
                             inf during ShardedOptimV2 step                     
[Epoch 0 / Train]:   0% 431/788222 [29:28<796:09:33,  3.64s/it, loss=nan, lr=2.5e-5, throughput=0.28555 sample_per_sec, 2.3376 Tflops][09/30/22 02:46:23] WARNING  colossalai - ShardedOptimizerV2 - WARNING:         
                             /usr/local/lib/python3.7/dist-packages/colossalai/z
                             ero/sharded_optim/sharded_optim_v2.py:197 step     
                    WARNING  colossalai - ShardedOptimizerV2 - WARNING: found   
                             inf during ShardedOptimV2 step                     
[Epoch 0 / Train]:   0% 432/788222 [29:31<786:20:54,  3.59s/it, loss=nan, lr=2.5e-5, throughput=0.2867 sample_per_sec, 2.3471 Tflops][09/30/22 02:46:26] WARNING  colossalai - ShardedOptimizerV2 - WARNING:         
                             /usr/local/lib/python3.7/dist-packages/colossalai/z
                             ero/sharded_optim/sharded_optim_v2.py:197 step     
                    WARNING  colossalai - ShardedOptimizerV2 - WARNING: found   
                             inf during ShardedOptimV2 step                     
[Epoch 0 / Train]:   0% 433/788222 [29:35<779:49:00,  3.56s/it, loss=nan, lr=2.5e-5, throughput=0.28628 sample_per_sec,                 

loss=nan keep forever for every iteration. And this would happen randomly after 1000 iterations or 10k iterations or even a few hundreds.

binmakeswell commented 1 year ago

We have updated a lot. This issue was closed due to inactivity. Thanks.