jiahe7ay / MINI_LLM

This is a repository used by individuals to experiment and reproduce the pre-training process of LLM.
348 stars 53 forks source link

预训练到特定step出现OutOfMemoryError #23

Open Tshirt-hfk opened 6 months ago

Tshirt-hfk commented 6 months ago

重试多次,预训练均在到特定step出现OutOfMemoryError 61%|██████ | 12840/21167 [14:00:18<8:40:59, 3.75s/it] 61%|██████ | 12841/21167 [14:00:21<8:33:50, 3.70s/it] 61%|██████ | 12842/21167 [14:00:25<8:27:47, 3.66s/it] 61%|██████ | 12843/21167 [14:00:28<8:25:30, 3.64s/it] 61%|██████ | 12844/21167 [14:00:32<8:25:06, 3.64s/it] 61%|██████ | 12845/21167 [14:00:36<8:24:13, 3.64s/it] 61%|██████ | 12846/21167 [14:00:39<8:23:55, 3.63s/it] 61%|██████ | 12847/21167 [14:00:43<8:20:11, 3.61s/it] 61%|██████ | 12848/21167 [14:00:46<8:14:37, 3.57s/it] 61%|██████ | 12849/21167 [14:00:50<8:12:50, 3.56s/it] 61%|██████ | 12850/21167 [14:00:53<8:13:45, 3.56s/it]Traceback (most recent call last): File "/home/tiger/MINI_LLM/pre_train.py", line 262, in trainer.train( #'model_save/pre/checkpoint-3400' File "/home/tiger/.local/lib/python3.9/site-packages/transformers/trainer.py", line 1537, in train return inner_training_loop( File "/home/tiger/.local/lib/python3.9/site-packages/transformers/trainer.py", line 1854, in _inner_training_loop tr_loss_step = self.training_step(model, inputs) File "/home/tiger/.local/lib/python3.9/site-packages/transformers/trainer.py", line 2735, in training_step loss = self.compute_loss(model, inputs) File "/home/tiger/.local/lib/python3.9/site-packages/transformers/trainer.py", line 2758, in compute_loss outputs = model(inputs) File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(args, kwargs) File "/usr/local/lib/python3.9/dist-packages/torch/nn/parallel/distributed.py", line 1523, in forward else self._run_ddp_forward(*inputs, kwargs) File "/usr/local/lib/python3.9/dist-packages/torch/nn/parallel/distributed.py", line 1359, in _run_ddp_forward return self.module(*inputs, *kwargs) # type: ignore[index] File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, kwargs) File "/home/tiger/.local/lib/python3.9/site-packages/accelerate/utils/operations.py", line 817, in forward return model_forward(*args, *kwargs) File "/home/tiger/.local/lib/python3.9/site-packages/accelerate/utils/operations.py", line 805, in call return convert_to_fp32(self.model_forward(args, kwargs)) File "/home/tiger/.local/lib/python3.9/site-packages/accelerate/utils/operations.py", line 784, in convert_to_fp32 return recursively_apply(_convert_to_fp32, tensor, test_type=_is_fp16_bf16_tensor) File "/home/tiger/.local/lib/python3.9/site-packages/accelerate/utils/operations.py", line 127, in recursively_apply { File "/home/tiger/.local/lib/python3.9/site-packages/accelerate/utils/operations.py", line 128, in k: recursively_apply( File "/home/tiger/.local/lib/python3.9/site-packages/accelerate/utils/operations.py", line 135, in recursively_apply return func(data, *args, **kwargs) File "/home/tiger/.local/lib/python3.9/site-packages/accelerate/utils/operations.py", line 779, in _convert_to_fp32 return tensor.float() torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 12.04 GiB. GPU 7 has a total capacity of 79.35 GiB of which 8.27 GiB is free. Process 2987497 has 71.07 GiB memory in use. Of the allocated memory 65.19 GiB is allocated by PyTorch, and 3.39 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

数据统一截512最大长度,batch_size设置成16,gradient_accumulation_steps设置成8,刚开始启动训练时显存是够的 ![Uploading image.png…]()

A1yez1 commented 6 months ago

在pre_train里面的tokneizer也要加上truncation,虽然在数据预处理的时候最大长度为512,由于tokenizer的原因,在tokenizer之后最大长度不一定只有512,所以会爆显存,在tokenizer的时候加上truncation就可以了 def token_to_id(samples: dict) -> dict:

batch_txt = samples["text"]
outputs = tokenizer(
    batch_txt,
    padding=False,
    return_attention_mask=False,
    truncation=True,
    max_length=pretrain_args.max_seq_len
)

input_ids = [np.array(item, dtype=map_dtype) for item in outputs["input_ids"]]

return {"input_ids": input_ids}
jiahe7ay commented 6 months ago

上面说的是对的 这个是我实现上忽略的一点 我之后改一下 感谢上面同学的解答

jiahe7ay commented 6 months ago

已经把tokenizer截取的代码合并到最新版本

iissy commented 2 months ago

这样直接截断内容,对预训练有什么不好的影响吗,是不是将超过长度的,分成多个词条来训练会更好呢?