Flexgen fails to load - Githubissues

just-someguy commented 1 year ago

Describe the bug

I followed instructions exactly to run flexgen OPT models. I renamed and converted models, and changed my command line arguments appropriately. I tried this with official OPT 6.7b, Nerybus 6.7b, and Nerybus 2.7b. I've checked that the opt-config.json doesn't appear to have any issues, and rebuilt text-generation-webui requirements.txt. I've tried with different parameters such as --pin-weight false, defining GPU and CPU memory, and --load-in-8bit and it does not seem to work still. The only thing I can think of is my GPU not being happy with the torture of LLMs but I can run 2.7b models with limited tokens unquantized and up to 13b models quantized (very slowly) but I'm trying to run a 6-7b model for the good median of quality and speed, but with the exception of quantized LLaMa models using --pre-layer, failing to do so. No errors when converting the model, only when trying to load it. As stated, I do not run out of memory loading models otherwise, and I've tried modifying the --percent values to no avail. The wiki article heavily implies I can and should be able to load a 6.7b model with 4GB of VRAM (implying a 2.7b model should be a cakewalk) seeing as it states the author loaded a 13b model with 2GB. If VRAM is the issue, the wiki article needs to be written more clearly to indicate you need more to load the model before it can be brought down to a lower value.

Is there an existing issue for this?

[X] I have searched the existing issues

Reproduction

Renamed model to opt-6.7b as per flexgen's hard-coded model names Convert to flexgen using python convert-to-flexgen.py models/opt-6.7b Attempt running using command-line parameters --flexgen --compress-weight --model opt-6.7b

Screenshot

No response

Logs

Starting the web UI...
Gradio HTTP request redirected to localhost :)
Loading opt-6.7b...
Exception in thread Thread-3 (copy_worker_func):
Traceback (most recent call last):
  File "C:\Users\user\Downloads\oobabooga-windows\installer_files\env\lib\threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "C:\Users\user\Downloads\oobabooga-windows\installer_files\env\lib\threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "C:\Users\user\Downloads\oobabooga-windows\installer_files\env\lib\site-packages\flexgen\pytorch_backend.py", line 882, in copy_worker_func
    cpu_buf = torch.empty((1 * GB,), dtype=torch.float16, pin_memory=True)
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Traceback (most recent call last):
  File "C:\Users\user\Downloads\oobabooga-windows\text-generation-webui\server.py", line 921, in <module>
    shared.model, shared.tokenizer = load_model(shared.model_name)
  File "C:\Users\user\Downloads\oobabooga-windows\text-generation-webui\modules\models.py", line 84, in load_model
    model = OptLM(f"facebook/{model_name}", env, shared.args.model_dir, policy)
  File "C:\Users\user\Downloads\oobabooga-windows\installer_files\env\lib\site-packages\flexgen\flex_opt.py", line 637, in __init__
    self.init_all_weights()
  File "C:\Users\user\Downloads\oobabooga-windows\installer_files\env\lib\site-packages\flexgen\flex_opt.py", line 800, in init_all_weights
    self.init_weight(j)
  File "C:\Users\user\Downloads\oobabooga-windows\installer_files\env\lib\site-packages\flexgen\flex_opt.py", line 651, in init_weight
    self.layers[j].init_weight(self.weight_home[j], expanded_path)
  File "C:\Users\user\Downloads\oobabooga-windows\installer_files\env\lib\site-packages\flexgen\flex_opt.py", line 158, in init_weight
    weights = init_weight_list(weight_specs, self.policy, self.env)
  File "C:\Users\user\Downloads\oobabooga-windows\installer_files\env\lib\site-packages\flexgen\flex_opt.py", line 124, in init_weight_list
    weight.load_from_np_file(weight_specs[i][2])
  File "C:\Users\user\Downloads\oobabooga-windows\installer_files\env\lib\site-packages\flexgen\pytorch_backend.py", line 125, in load_from_np_file
    self.load_from_np(np.load(filename))
  File "C:\Users\user\Downloads\oobabooga-windows\installer_files\env\lib\site-packages\flexgen\pytorch_backend.py", line 117, in load_from_np
    general_copy(self, None, tmp, None)
  File "C:\Users\user\Downloads\oobabooga-windows\installer_files\env\lib\site-packages\flexgen\pytorch_backend.py", line 830, in general_copy
    general_copy_compressed(dst, dst_indices, src, src_indices)
  File "C:\Users\user\Downloads\oobabooga-windows\installer_files\env\lib\site-packages\flexgen\compression.py", line 214, in general_copy_compressed
    general_copy(dst.data[0], dst_data_indices, src.data[0], src_data_indices)
  File "C:\Users\user\Downloads\oobabooga-windows\installer_files\env\lib\site-packages\flexgen\pytorch_backend.py", line 855, in general_copy
    dst.copy_(src, non_blocking=True)
RuntimeError: The size of tensor a (25136) must match the size of tensor b (25133) at non-singleton dimension 0

System Info

Windows 11
16GB RAM
RTX 3050TI Mobile 4GB

Tom-Neverwinter commented 1 year ago

what cpu? is it an amd? intel?

github-actions[bot] commented 1 year ago

This issue has been closed due to inactivity for 30 days. If you believe it is still relevant, please leave a comment below.

oobabooga / text-generation-webui