程序是否存在显存泄露

ZBWpro commented 10 months ago

我在更换prompt后，维持原有超参数及数据集，对opt-6.7b进行微调，模型在训练到5%时突然出现CUDA out of memory，我对此感到非常奇怪，因为更换后的prompt在token数量上比原始prompt更短，且在训练至5%前均行为正常。

部分输出如下：

  5%|▍         | 50/1077 [03:47<1:17:04,  4.50s/it]

{'loss': 0.7197, 'learning_rate': 0.00025, 'epoch': 0.05}

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1024.00 MiB (GPU 3; 23.65 GiB total capacity; 17.96 GiB already allocated; 372.56 MiB free; 22.06 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2377859 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2377860 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2377861 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 3 (pid: 2377862) of binary: ...

我阅读源码后并未找出问题，不知您这边是否有遇到类似的情况？

kongds commented 10 months ago

感谢关注我们的工作

这个应该是在保存模型的时候出现了OOM，可能与peft和transformers版本的关系。请按照requirements.txt来安装环境，这样应该在4x3090中不会遇到这个问题

ZBWpro commented 10 months ago

非常感谢您的迅速回复，我在按照requirements.txt安装环境后，于4 x 4090并行训练，发现该问题的出现提前到了1%，相关输出如下：

 1%|          | 6/1077 [00:29<1:27:08,  4.88s/it]Traceback (most recent call last):

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1024.00 MiB (GPU 2; 23.65 GiB total capacity; 17.96 GiB already allocated; 372.56 MiB free; 22.06 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2431678 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2431679 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2431681 closing signal SIGTERM

我所尝试的 prompt 为：

This sentence : "*sent_0*" means something

句向量的获取方式仍为

pooler_output = model(output_hidden_states=True, return_dict=True, **inputs).hidden_states[-1][:, -1, :]

不知您是否方便对该prompt进行测试

kongds commented 10 months ago

我看了一下 OPT的tokenizer好像有些问题，可能会导致长度超长可以试试在这行后面加,再试试https://github.com/kongds/scaling_sentemb/blob/8567aa083c1b3c77586670f91e7f78eb80694ad3/ft_llm.py#L232

        # some cases the string after decode(encode(text)[:len]) has bigger length than len
        if len(tokenizer.encode(input, add_special_tokens=False)) > cutoff_len:
            input = tokenizer.decode(tokenizer.encode(input, add_special_tokens=False)[:cutoff_len])

ZBWpro commented 10 months ago

感谢作者的耐心解答，增添上述代码后，模型可以完成训练，我将关闭该issue。

kongds / scaling_sentemb

程序是否存在显存泄露 #11