kongds / scaling_sentemb

Scaling Sentence Embeddings with Large Language Models
100 stars 4 forks source link

程序是否存在显存泄露 #11

Closed ZBWpro closed 10 months ago

ZBWpro commented 10 months ago

我在更换prompt后,维持原有超参数及数据集,对opt-6.7b进行微调,模型在训练到5%时突然出现CUDA out of memory,我对此感到非常奇怪,因为更换后的prompt在token数量上比原始prompt更短,且在训练至5%前均行为正常。

部分输出如下:

  5%|▍         | 50/1077 [03:47<1:17:04,  4.50s/it]

{'loss': 0.7197, 'learning_rate': 0.00025, 'epoch': 0.05}

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1024.00 MiB (GPU 3; 23.65 GiB total capacity; 17.96 GiB already allocated; 372.56 MiB free; 22.06 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2377859 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2377860 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2377861 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 3 (pid: 2377862) of binary: ...

我阅读源码后并未找出问题,不知您这边是否有遇到类似的情况?

kongds commented 10 months ago

感谢关注我们的工作

这个应该是在保存模型的时候出现了OOM,可能与peft和transformers版本的关系。 请按照requirements.txt来安装环境,这样应该在4x3090中不会遇到这个问题

ZBWpro commented 10 months ago

非常感谢您的迅速回复,我在按照requirements.txt安装环境后,于4 x 4090并行训练,发现该问题的出现提前到了1%,相关输出如下:

 1%|          | 6/1077 [00:29<1:27:08,  4.88s/it]Traceback (most recent call last):

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1024.00 MiB (GPU 2; 23.65 GiB total capacity; 17.96 GiB already allocated; 372.56 MiB free; 22.06 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2431678 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2431679 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2431681 closing signal SIGTERM

我所尝试的 prompt 为 :

This sentence : "*sent_0*" means something

句向量的获取方式仍为

pooler_output = model(output_hidden_states=True, return_dict=True, **inputs).hidden_states[-1][:, -1, :]

不知您是否方便对该prompt进行测试

kongds commented 10 months ago

我看了一下 OPT的tokenizer好像有些问题,可能会导致长度超长 可以试试在这行后面加,再试试https://github.com/kongds/scaling_sentemb/blob/8567aa083c1b3c77586670f91e7f78eb80694ad3/ft_llm.py#L232

        # some cases the string after decode(encode(text)[:len]) has bigger length than len
        if len(tokenizer.encode(input, add_special_tokens=False)) > cutoff_len:
            input = tokenizer.decode(tokenizer.encode(input, add_special_tokens=False)[:cutoff_len])
ZBWpro commented 10 months ago

感谢作者的耐心解答,增添上述代码后,模型可以完成训练,我将关闭该issue。