h2oai / h2ogpt

Private chat with local GPT with document, images, video, etc. 100% private, Apache 2.0. Supports oLLaMa, Mixtral, llama.cpp, and more. Demo: https://gpt.h2o.ai/ https://gpt-docs.h2o.ai/
http://h2o.ai
Apache License 2.0
11.15k stars 1.22k forks source link

A bug to run h2ogpt-4096-llama2-70b-chat #692

Open babytdream opened 1 year ago

babytdream commented 1 year ago

image I have 16 * A10. But when I run this recommend “python generate.py --base_model=meta-llama/Llama-2-70b-chat-hf --prompt_type=llama2 --rope_scaling="{'type': 'linear', 'factor': 4}" --use_gpu_id=False --save_dir=savemeta70b”,it appears a BUG: image

pseudotensor commented 1 year ago

https://discuss.pytorch.org/t/cuda-error-peer-mapping-resources-exhausted/167814 https://forums.developer.nvidia.com/t/cuda-peer-resources-error-when-running-on-more-than-8-k80s-aws-p2-16xlarge/45351

I'm not familiar with the issue, but it should work if you restrict to 8 GPUs using CUDA_VISIBLE_DEVICES, e.g.

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python generate.py --base_model=meta-llama/Llama-2-70b-chat-hf --prompt_type=llama2 --rope_scaling="{'type': 'linear', 'factor': 4}" --use_gpu_id=False --save_dir=savemeta70b

Note rope scaling in transformers is still experimental. It does work in many cases, but may not in all.

It may be possible to use 16 GPUs if you use the other methods, not P2P. E.g. use distributed approaches. @arnocandel do you know?

babytdream commented 1 year ago

Hello,8 gpu is ok!But more gpu is helpful! Do you use "torch.nn.DataParalle" or "torch.nn.parallel.DistributedDataParallel" in h2ogpt?Thanks!

pseudotensor commented 1 year ago

@arnocandel would know best about that.

babytdream commented 1 year ago

I find that this error only appears GPU > 11. When I use GPU <= 10, it can work! Like this command:CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,8,9 python generate.py --base_model=/data/model/h2ogpt-4096-llama2-70b-chat/ --prompt_type=llama2 --use_gpu_id=False --share=True image

pseudotensor commented 1 year ago

Good! @arnocandel would know best about how to run h2oGPT in distributed mode to do 8*8 mode

arnocandel commented 1 year ago

I haven't ever done distributed generation with > 8 GPUs, just distributed training (via ddp=False) but even that is currently not working, WIP to bring back: https://github.com/h2oai/h2ogpt/pull/644, but lower priority since H2O LLM Studio is working on DeepSpeed https://github.com/h2oai/h2o-llmstudio/pull/288

More GPUs doesn't make it faster, just allows for larger models: https://github.com/h2oai/h2ogpt/blob/main/benchmarks/perf.md

babytdream commented 1 year ago

@arnocandel Thanks!The use of 16 GPUs is mainly to fine-tune the big model like LLama-70B using lora technology.Do you have some ideas to fine-tune using 16 GPUs?

pseudotensor commented 1 year ago

@babytdream For fine-tuning, best to try out LLM Studio. It's more well-developed and easy to use for fine-tuning than h2oGPT: https://github.com/h2oai/h2o-llmstudio

arnocandel commented 1 year ago

@babytdream you can certainly fine-tune using 16 separate processes:

I did something similar here across 3 machines, with 2-3 processes each: https://github.com/h2oai/h2ogpt/blob/542f20495c46fcb44d60c05e1c8fde60e15f8aeb/finetune.py#L664-L669

You could try something like this:

NCCL_P2P_LEVEL=LOC WORLD_SIZE=16 CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15" torchrun --nproc_per_node=16 --master_port=1234 finetune.py --micro_batch_size=1 --batch_size=16
babytdream commented 1 year ago

@arnocandel OK!I see the data/config.json. If I try to fintune LLama-2-70B, should I modify the parameters like prompt_type?

prompt_type="llama2" NCCL_P2P_LEVEL=LOC WORLD_SIZE=16 CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15" torchrun --nproc_per_node=16 --master_port=1234 finetune.py --micro_batch_size=1 --batch_size=16

  [
  {
      "prompt_type": **"llama2"**,
      "instruction": "Explain the following expert setting for Driverless AI",
      "input": "text.....",
      "output": "text..... "
  },
  ]

And I see your ideas in #574 . Your command is :CUDA_VISIBLE_DEVICES=0,1 torchrun --nproc_per_node=2 finetune.py --data_path=h2oai/openassistant_oasst1_h2ogpt_llama2_chat --base_model=meta-llama/Llama-2-70b-chat-hf --drop_truncations=True --cutoff_len=1024 --micro_batch_size=1 --batch_size=16 --num_epochs=1 --learning_rate=1e-5 --run_id=6 --use_auth_token=True --add_eos_token=True --train_4bit=True &> log.6.txt

Imitate your command,my command is :NCCL_P2P_LEVEL=LOC WORLD_SIZE=16 CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15" torchrun --nproc_per_node=16 finetune.py --data_path=h2oai/openassistant_oasst1_h2ogpt_llama2_chat --base_model='/data/model/h2ogpt-4096-llama2-70b-chat/' --drop_truncations=True --cutoff_len=1024 --micro_batch_size=1 --batch_size=1 --num_epochs=1 --learning_rate=1e-5 --run_id=6 --use_auth_token=True --add_eos_token=True --train_4bit=True --gradient_accumulation_steps=16 &> log.3.txt

The erro shows: torch.cuda.OutOfMemoryError: CUDA out of memory.*But I have 16 24G, bigger than 2x A100 80GB.** log.3.txt

image Even if I replace 70B llama2 with 13b llama, the error still occurs:orch.cuda.OutOfMemoryError: CUDA out of memory.

arnocandel commented 1 year ago

yeah, above command is assuming that the model fits on every GPU (and GPUs are used to train more mini batches at once), which the 70B doesn't, not even with 4bit.

arnocandel commented 1 year ago

643 would be needed

babytdream commented 1 year ago

I also ask the same question in LLM Studio, they can't solve it. @arnocandel sorry!Can you be more specific, it's a bit difficult for me. Or give me a similar command.