oobabooga / text-generation-webui

A Gradio web UI for Large Language Models.
GNU Affero General Public License v3.0
40.43k stars 5.3k forks source link

Cannot load Pyg-6B with 8GB VRAM with deepspeed on WSL2 #150

Closed sashasubbbb closed 1 year ago

sashasubbbb commented 1 year ago

I've got WSL2 Ubuntu running on Windows 11 configured to use 28 GB of RAM: Tried both unsharded and sharded to 1GB chuncks model.

free -h --giga

              total        used        free      shared  buff/cache   available
Mem:            28G        108M         27G        0.0K        745M         27G
Swap:          7.2G          0B        7.2G

When i try to load pyg-6b model with:

deepspeed --num_gpus=1 server.py --deepspeed --cai-chat

I get:

[2023-02-28 19:49:23,376] [WARNING] [runner.py:186:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. [2023-02-28 19:49:23,448] [INFO] [runner.py:548:main] cmd = /root/miniconda3/envs/textgen/bin/python -u -m deepspeed.launcher.launch --world_info= --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None server.py --deepspeed --cai-chat [2023-02-28 19:49:25,028] [INFO] [launch.py:142:main] WORLD INFO DICT: {'localhost': [0]} [2023-02-28 19:49:25,028] [INFO] [launch.py:148:main] nnodes=1, num_local_procs=1, node_rank=0 [2023-02-28 19:49:25,029] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]}) [2023-02-28 19:49:25,029] [INFO] [launch.py:162:main] dist_world_size=1 [2023-02-28 19:49:25,029] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0 [2023-02-28 19:49:29,383] [INFO] [comm.py:657:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl Warning: chat mode currently becomes somewhat slower with text streaming on. Consider starting the web UI with the --no-stream option.

Loading the extension "gallery"... Ok. The following models are available:

  1. pyg6shard
  2. pygmalion-350m

Which one do you want to load? 1-2

1

Loading pyg6shard... Loading pyg6shard... [2023-02-28 19:49:33,464] [INFO] [partition_parameters.py:415:exit] finished initializing model with 0.54B parameters Traceback (most recent call last): File "/home/user/AI/AItext/oobabooga/text-generation-webui/server.py", line 185, in shared.model, shared.tokenizer = load_model(shared.model_name) File "/home/user/AI/AItext/oobabooga/text-generation-webui/modules/models.py", line 73, in load_model model = AutoModelForCausalLM.from_pretrained(Path(f"models/{shared.model_name}"), torch_dtype=torch.bfloat16 if shared.args.bf16 else torch.float16) File "/root/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 471, in from_pretrained return model_class.from_pretrained( File "/root/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2495, in from_pretrained model = cls(config, *model_args, model_kwargs) File "/root/miniconda3/envs/textgen/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 355, in wrapper f(module, *args, *kwargs) File "/root/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/models/gptj/modeling_gptj.py", line 727, in init self.transformer = GPTJModel(config) File "/root/miniconda3/envs/textgen/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 355, in wrapper f(module, args, kwargs) File "/root/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/models/gptj/modelinggptj.py", line 480, in init self.h = nn.ModuleList([GPTJBlock(config) for in range(config.n_layer)]) File "/root/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/models/gptj/modelinggptj.py", line 480, in self.h = nn.ModuleList([GPTJBlock(config) for in range(config.n_layer)]) File "/root/miniconda3/envs/textgen/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 355, in wrapper f(module, *args, kwargs) File "/root/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/models/gptj/modeling_gptj.py", line 288, in init self.mlp = GPTJMLP(inner_dim, config) File "/root/miniconda3/envs/textgen/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 355, in wrapper f(module, *args, *kwargs) File "/root/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/models/gptj/modeling_gptj.py", line 268, in init self.fc_in = nn.Linear(embed_dim, intermediate_size) File "/root/miniconda3/envs/textgen/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 363, in wrapper self._post_init_method(module) File "/root/miniconda3/envs/textgen/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 760, in _post_init_method param.partition() File "/root/miniconda3/envs/textgen/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 894, in partition self._partition(param_list, has_been_updated=has_been_updated) File "/root/miniconda3/envs/textgen/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1038, in _partition self._partition_param(param, has_been_updated=has_been_updated) File "/root/miniconda3/envs/textgen/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 9, in wrapped_fn ret_val = func(args, kwargs) File "/root/miniconda3/envs/textgen/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1111, in _partition_param partitioned_tensor = get_accelerator().pin_memory( File "/root/miniconda3/envs/textgen/lib/python3.10/site-packages/deepspeed/accelerator/cuda_accelerator.py", line 214, in pin_memory return tensor.pin_memory() RuntimeError: CUDA error: out of memory CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. [2023-02-28 19:49:35,039] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 73 [2023-02-28 19:49:35,040] [ERROR] [launch.py:324:sigkill_handler] ['/root/miniconda3/envs/textgen/bin/python', '-u', 'server.py', '--local_rank=0', '--deepspeed', '--cai-chat'] exits with return code = 1

I've managed to load pyg-350m model just fine with deepspeed. Is deepspeed working incorrectly on WSL? Do you have any clue?

ye7iaserag commented 1 year ago

From my experience deepspeed wont run on wsl, or even over docker Maybe someone can prove me wrong

LTSarc commented 1 year ago

I got it working!; it needs a ton of memory though. More than enough RAM assigned to the VM to load the whole thing in RAM before getting dumped to VRAM.

Also, you need to jump through a ton of hoops to get the cuda toolkit working with conda-forge on WSL (as well as of course fixing WSL's issues with DNS passthrough and GPU passthrough...). But... it works. image

LTSarc commented 1 year ago

I've written a guide on how to do this for total noobs in the context of pygmalion over on the pygmalion subreddit HERE.

TL;DR: WSL2 has a completely broken implementation of DNS and CUDA, and that is the issue. Oh, and the Error -9 that pops up on deepspeed a lot? That's the VM running out of RAM. So you have to reconfigure the amount of RAM the VM gets, deepspeed loads the entire model into sysRAM before offloading to VRAM - and the default 8-12GB allocation is too small.

ye7iaserag commented 1 year ago

I'm using docker over wsl2 and I was able to run GPT-NeoXT-Chat-Base-20B using: python server.py --auto-devices --gpu-memory 8 --cai-chat --load-in-8bit --listen --listen-port 8888 --model=GPT-NeoXT-Chat-Base-20B which is 38.4GB in size and I didn't need to update the .wslconfig file Maybe I'm missing something

ye7iaserag commented 1 year ago

Also I was getting error code -11 not -9

github-actions[bot] commented 1 year ago

This issue has been closed due to inactivity for 30 days. If you believe it is still relevant, please leave a comment below.