I followed instructions exactly to run flexgen OPT models. I renamed and converted models, and changed my command line arguments appropriately. I tried this with official OPT 6.7b, Nerybus 6.7b, and Nerybus 2.7b. I've checked that the opt-config.json doesn't appear to have any issues, and rebuilt text-generation-webui requirements.txt. I've tried with different parameters such as --pin-weight false, defining GPU and CPU memory, and --load-in-8bit and it does not seem to work still. The only thing I can think of is my GPU not being happy with the torture of LLMs but I can run 2.7b models with limited tokens unquantized and up to 13b models quantized (very slowly) but I'm trying to run a 6-7b model for the good median of quality and speed, but with the exception of quantized LLaMa models using --pre-layer, failing to do so. No errors when converting the model, only when trying to load it.
As stated, I do not run out of memory loading models otherwise, and I've tried modifying the --percent values to no avail. The wiki article heavily implies I can and should be able to load a 6.7b model with 4GB of VRAM (implying a 2.7b model should be a cakewalk) seeing as it states the author loaded a 13b model with 2GB. If VRAM is the issue, the wiki article needs to be written more clearly to indicate you need more to load the model before it can be brought down to a lower value.
Is there an existing issue for this?
[X] I have searched the existing issues
Reproduction
Renamed model to opt-6.7b as per flexgen's hard-coded model names
Convert to flexgen using python convert-to-flexgen.py models/opt-6.7b
Attempt running using command-line parameters --flexgen --compress-weight --model opt-6.7b
Screenshot
No response
Logs
Starting the web UI...
Gradio HTTP request redirected to localhost :)
Loading opt-6.7b...
Exception in thread Thread-3 (copy_worker_func):
Traceback (most recent call last):
File "C:\Users\user\Downloads\oobabooga-windows\installer_files\env\lib\threading.py", line 1016, in _bootstrap_inner
self.run()
File "C:\Users\user\Downloads\oobabooga-windows\installer_files\env\lib\threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "C:\Users\user\Downloads\oobabooga-windows\installer_files\env\lib\site-packages\flexgen\pytorch_backend.py", line 882, in copy_worker_func
cpu_buf = torch.empty((1 * GB,), dtype=torch.float16, pin_memory=True)
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Traceback (most recent call last):
File "C:\Users\user\Downloads\oobabooga-windows\text-generation-webui\server.py", line 921, in <module>
shared.model, shared.tokenizer = load_model(shared.model_name)
File "C:\Users\user\Downloads\oobabooga-windows\text-generation-webui\modules\models.py", line 84, in load_model
model = OptLM(f"facebook/{model_name}", env, shared.args.model_dir, policy)
File "C:\Users\user\Downloads\oobabooga-windows\installer_files\env\lib\site-packages\flexgen\flex_opt.py", line 637, in __init__
self.init_all_weights()
File "C:\Users\user\Downloads\oobabooga-windows\installer_files\env\lib\site-packages\flexgen\flex_opt.py", line 800, in init_all_weights
self.init_weight(j)
File "C:\Users\user\Downloads\oobabooga-windows\installer_files\env\lib\site-packages\flexgen\flex_opt.py", line 651, in init_weight
self.layers[j].init_weight(self.weight_home[j], expanded_path)
File "C:\Users\user\Downloads\oobabooga-windows\installer_files\env\lib\site-packages\flexgen\flex_opt.py", line 158, in init_weight
weights = init_weight_list(weight_specs, self.policy, self.env)
File "C:\Users\user\Downloads\oobabooga-windows\installer_files\env\lib\site-packages\flexgen\flex_opt.py", line 124, in init_weight_list
weight.load_from_np_file(weight_specs[i][2])
File "C:\Users\user\Downloads\oobabooga-windows\installer_files\env\lib\site-packages\flexgen\pytorch_backend.py", line 125, in load_from_np_file
self.load_from_np(np.load(filename))
File "C:\Users\user\Downloads\oobabooga-windows\installer_files\env\lib\site-packages\flexgen\pytorch_backend.py", line 117, in load_from_np
general_copy(self, None, tmp, None)
File "C:\Users\user\Downloads\oobabooga-windows\installer_files\env\lib\site-packages\flexgen\pytorch_backend.py", line 830, in general_copy
general_copy_compressed(dst, dst_indices, src, src_indices)
File "C:\Users\user\Downloads\oobabooga-windows\installer_files\env\lib\site-packages\flexgen\compression.py", line 214, in general_copy_compressed
general_copy(dst.data[0], dst_data_indices, src.data[0], src_data_indices)
File "C:\Users\user\Downloads\oobabooga-windows\installer_files\env\lib\site-packages\flexgen\pytorch_backend.py", line 855, in general_copy
dst.copy_(src, non_blocking=True)
RuntimeError: The size of tensor a (25136) must match the size of tensor b (25133) at non-singleton dimension 0
Describe the bug
I followed instructions exactly to run flexgen OPT models. I renamed and converted models, and changed my command line arguments appropriately. I tried this with official OPT 6.7b, Nerybus 6.7b, and Nerybus 2.7b. I've checked that the opt-config.json doesn't appear to have any issues, and rebuilt text-generation-webui requirements.txt. I've tried with different parameters such as
--pin-weight false
, defining GPU and CPU memory, and --load-in-8bit
and it does not seem to work still. The only thing I can think of is my GPU not being happy with the torture of LLMs but I can run 2.7b models with limited tokens unquantized and up to 13b models quantized (very slowly) but I'm trying to run a 6-7b model for the good median of quality and speed, but with the exception of quantized LLaMa models using--pre-layer
, failing to do so. No errors when converting the model, only when trying to load it. As stated, I do not run out of memory loading models otherwise, and I've tried modifying the--percent
values to no avail. The wiki article heavily implies I can and should be able to load a 6.7b model with 4GB of VRAM (implying a 2.7b model should be a cakewalk) seeing as it states the author loaded a 13b model with 2GB. If VRAM is the issue, the wiki article needs to be written more clearly to indicate you need more to load the model before it can be brought down to a lower value.Is there an existing issue for this?
Reproduction
Renamed model to opt-6.7b as per flexgen's hard-coded model names Convert to flexgen using
python convert-to-flexgen.py models/opt-6.7b
Attempt running using command-line parameters--flexgen --compress-weight --model opt-6.7b
Screenshot
No response
Logs
System Info