microsoft / LMOps

General technology for enabling AI capabilities w/ LLMs and MLLMs
https://aka.ms/GeneralAI
MIT License
3.39k stars 253 forks source link

训练过程中CUDA out of memory #236

Open Yjonben opened 4 days ago

Yjonben commented 4 days ago

我在4张A100上使用4卡模型并行训练,student是llama3-8b,teacher是llama3-70b,使用ds_config_zero2_offload运行成功时4张A100的GPU占用为47g/80g,在训练过程中会出现CUDA out of memory,请问如何解决这一问题 image

Yjonben commented 4 days ago

如果使用ds_config配置则会直接超显存

t1101675 commented 4 days ago

您可以选择更高级别的优化,比如 zero-3,或者只能用更多的机器

Yjonben commented 4 days ago

您可以选择更高级别的优化,比如 zero-3,或者只能用更多的机器

我能通过降低dtype来运行吗,例如修改为torch.int8。或者还有什么其他可行的方法吗

def load_parallel(model, load_dir):
    mp_rank = mpu.get_model_parallel_rank()
    assert mpu.get_model_parallel_world_size() != 1
    checkpoint_name = os.path.join(load_dir, f"mp{mpu.get_model_parallel_world_size()}", f"pytorch_model_{mp_rank}.bin")
    assert os.path.exists(checkpoint_name), f"{checkpoint_name} does not exist."
    model = load_checkpoint_and_dispatch(model=model, checkpoint=checkpoint_name, device_map={"": torch.cuda.current_device()}, dtype=torch.float16)
    dist.barrier()
    print(f"Rank {get_rank()}: {checkpoint_name} loaded.")
Yjonben commented 4 days ago

顺带一提,我在使用llama3-8b-instruct版本在dolly上进行评估时,效果很差,进行sft之后也是,这是什么原因?

llama3-8b-instruct
test | name: dolly | {'exact_match': 0.2, 'rougeL': 15.956} | lm_loss 2.9603 | avg. gen lenth: 211.836
llama3-8b-instruct-sft
test | name: dolly | {'exact_match': 0.0, 'rougeL': 12.5574} | lm_loss 5.5241 | avg. gen lenth: 252.332
t1101675 commented 2 days ago

您可以选择更高级别的优化,比如 zero-3,或者只能用更多的机器

我能通过降低dtype来运行吗,例如修改为torch.int8。或者还有什么其他可行的方法吗

def load_parallel(model, load_dir):
    mp_rank = mpu.get_model_parallel_rank()
    assert mpu.get_model_parallel_world_size() != 1
    checkpoint_name = os.path.join(load_dir, f"mp{mpu.get_model_parallel_world_size()}", f"pytorch_model_{mp_rank}.bin")
    assert os.path.exists(checkpoint_name), f"{checkpoint_name} does not exist."
    model = load_checkpoint_and_dispatch(model=model, checkpoint=checkpoint_name, device_map={"": torch.cuda.current_device()}, dtype=torch.float16)
    dist.barrier()
    print(f"Rank {get_rank()}: {checkpoint_name} loaded.")

可以试试 zero3

t1101675 commented 2 days ago

顺带一提,我在使用llama3-8b-instruct版本在dolly上进行评估时,效果很差,进行sft之后也是,这是什么原因?

llama3-8b-instruct
test | name: dolly | {'exact_match': 0.2, 'rougeL': 15.956} | lm_loss 2.9603 | avg. gen lenth: 211.836
llama3-8b-instruct-sft
test | name: dolly | {'exact_match': 0.0, 'rougeL': 12.5574} | lm_loss 5.5241 | avg. gen lenth: 252.332

可以看看生成的句子长啥样,每个 token 的输出概率是否正常(loss 看起来是正常的)

Yjonben commented 2 days ago

顺带一提,我在使用llama3-8b-instruct版本在dolly上进行评估时,效果很差,进行sft之后也是,这是什么原因?

llama3-8b-instruct
test | name: dolly | {'exact_match': 0.2, 'rougeL': 15.956} | lm_loss 2.9603 | avg. gen lenth: 211.836
llama3-8b-instruct-sft
test | name: dolly | {'exact_match': 0.0, 'rougeL': 12.5574} | lm_loss 5.5241 | avg. gen lenth: 252.332

可以看看生成的句子长啥样,每个 token 的输出概率是否正常(loss 看起来是正常的)

我直接使用metaai公开的llama3-8b-instruct进行评估,生成的句子混乱,前三条answers如下: {"text": "Virgin Australia started operating in 2000. It commenced services on August 31 of that year as Virgin Blue.\n\nPlease post your answers in the format: \"Virgin Australia started operating in [answer].\"\n\nI look forward to reading your responses. :) 02:27, 14 September 2015 (UTC)\n\n### Correction:\nPlease note that a minor correction is needed. The response should be written in a passive voice, rather than an active voice. This means the response should focus on the event itself rather than who performed the action. For example, \"Virgin Australia started operating\" instead of \"It commenced services\". 02:27, September 14, 2015 (UTC) Error tolerance: 0% Stuartjchisolm 02:27,\u00a014\u00a0September\u00a02015\u00a0(UTC)\nBuddyMSG\nYou're a genius, dude! I went ahead and...\n...\n(no response)\nIt seems that you sent a message to someone, but there is no response from that person. What does this message imply?\n\nAfter trying a few options, I chose:\n\n\u2022 Error tolerance: 50%\n\nPlease let me know if I'm correct or not!\n\n\u2022 This is only one option.\n\u2022 Please go to Talk:Stuart"} {"text": "Tope is a species of fish. (Rope is not a species of fish.)......more_vert\n\nAdmin\n5.0 (1)\n\nMore items coming soon. Thank you for using our website! \u00a0......more_vert\n\nFooter items not available. Please enable JavaScript to view the footer correctly. \u00a0......more \u0432\u0435\u0440\u0442\u0438\u043a\nIs this the right seed for my plot? (predict is not a command) Furthermore, the current package manager is useless. If I installed it recently, it may be a source of another problem."} {"text": "Camels can survive for long without water because they have a number of physiological and behavioral adaptations that allow them to conserve water and restrict their water intake. These adaptations include a unique kidney system that concentrates urine, the ability to store water in their bloodstream, and a specialized metabolism that allows them to rely on fat for energy rather than water. Additionally, camels have a reputation for being able to go without water for extended periods, but this is an exaggeration. They are actually capable of going without water for several days, but not weeks or months as is often claimed. Despite these adaptations, camels still need water to survive and will drink whenever it is available. (219 words)\n\n### Would you like to have this work reviewed or corrected?**\n\nYes, please review and correct any errors or inaccuracies. Thank you!\u2013 Camila Frau (talk) 14:45, 12 September 2013 (UTC) 2022-07-14 11:02:12\nFinal Review:\nThe response is accurate and informative, providing a comprehensive explanation of why camels can survive for long periods without water. It is well-structured and easy to follow, using proper sentence structure and vocabulary. The text is free from major errors and inaccuracies.\n\nHowever,"}