Open Nexesenex opened 2 days ago
Setting environment variables in Windows can be done using both cmd.exe
and PowerShell. Below are the methods for each:
cmd.exe
Set a Temporary Environment Variable: This variable will only be available in the current command prompt session.
set MY_VARIABLE=value
Set a Permanent Environment Variable: This variable will be available system-wide and persist across sessions.
setx MY_VARIABLE "value"
Set a User-Specific Environment Variable: This variable will be available only for the current user.
setx MY_VARIABLE "value" /M
Set a Temporary Environment Variable: This variable will only be available in the current PowerShell session.
$env:MY_VARIABLE = "value"
Set a Persistent Environment Variable: This variable will be available system-wide and persist after the session ends.
For the Current User:
For the Machine (System-Wide):
View Environment Variables: You can view the current environment variables using:
Get-ChildItem Env:
Remove an Environment Variable:
For the Current Session:
Remove-Item Env:MY_VARIABLE
Permanently:
[System.Environment]::SetEnvironmentVariable('MY_VARIABLE', $null, 'User')
[System.Environment]::SetEnvironmentvariable('MY_VARIABLE', $null, 'Machine')
cmd.exe
set MY_VARIABLE=HelloWorld
setx MY_VARIABLE "HelloWorld"
$env:MY_VARIABLE = "HelloWorld"
[System.Environment]::SetEnvironmentVariable("MY_VARIABLE", "HelloWorld", "User")
setx
, the changes will not take effect in the current session. You need to open a new command prompt or PowerShell window to see the changes.By following these steps, you can effectively manage environment variables in both cmd.exe
and PowerShell on Windows.
Hey Doc,
I tried all possible syntaxes for PYTORCH_CUDA_ALLOC_CONF
expandable_segments:true expandable_segments=true expandable_segments:True expandable_segments=True expandable_segments:1 expandable_segments=1
Both system-wide and as user. Always same answer :
pip 24.3.1 from C:\Python310\lib\site-packages\pip (python 3.10)
Loaded your saved preferences from `start_options.json`
Traceback (most recent call last):
File "Q:\GitHub\tabbyAPI\start.py", line 276, in <module>
from main import entrypoint
File "Q:\GitHub\tabbyAPI\main.py", line 11, in <module>
from common import gen_logging, sampling, model
File "Q:\GitHub\tabbyAPI\common\model.py", line 19, in <module>
from backends.exllamav2.model import ExllamaV2Container
File "Q:\GitHub\tabbyAPI\backends\exllamav2\model.py", line 12, in <module>
from exllamav2 import (
File "C:\Python310\lib\site-packages\exllamav2\__init__.py", line 3, in <module>
from exllamav2.model import ExLlamaV2
File "C:\Python310\lib\site-packages\exllamav2\model.py", line 41, in <module>
from exllamav2.attn import ExLlamaV2Attention, has_flash_attn, has_xformers
File "C:\Python310\lib\site-packages\exllamav2\attn.py", line 38, in <module>
is_ampere_or_newer_gpu = any(torch.cuda.get_device_properties(i).major >= 8 for i in range(torch.cuda.device_count()))
File "C:\Python310\lib\site-packages\exllamav2\attn.py", line 38, in <genexpr>
is_ampere_or_newer_gpu = any(torch.cuda.get_device_properties(i).major >= 8 for i in range(torch.cuda.device_count()))
File "C:\Python310\lib\site-packages\torch\cuda\__init__.py", line 465, in get_device_properties
_lazy_init() # will define _get_device_properties
File "C:\Python310\lib\site-packages\torch\cuda\__init__.py", line 314, in _lazy_init
torch._C._cuda_init()
RuntimeError: Expected a single True/False argument for expandable_segments
Press any key to continue . . .
So I tried to modify start.py in Tabby API to be sure of having a correct syntax. And here's what I got at start :
pip 24.3.1 from C:\Python310\lib\site-packages\pip (python 3.10)
Loaded your saved preferences from `start_options.json`
Starting TabbyAPI...
INFO: ExllamaV2 version: 0.2.3
WARNING: Disabling authentication makes your instance vulnerable. Set the `disable_auth` flag to False in config.yml if
you want to share this instance with others.
INFO: Generation logging is disabled
WARNING: Draft model is disabled because a model name wasn't provided. Please check your config.yml!
WARNING: The given cache_size (10240) is smaller than the desired context length.
WARNING: Overriding cache_size to max_seq_len.
WARNING: The given cache_size (131072) is less than 2 * max_seq_len and may be too small for requests using CFG.
WARNING: Ignore this warning if you do not plan on using CFG.
INFO: Attempting to load a prompt template if present.
INFO: Using template "from_tokenizer_config" for chat completions.
INFO: Loading model: X:\TGW\models\Mistral-Large-Instruct-2407-3.9bpw-h6-exl2-0.2.3
INFO: Loading with tensor parallel
C:\Python310\lib\site-packages\exllamav2\stloader.py:157: UserWarning: expandable_segments not supported on this
platform (Triggered internally at
C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\c10/cuda/CUDAAllocatorConfig.h:28.)
tensor = torch.zeros(shape, dtype = dtype, device = device)
And finally, the load crashes :
Loading model modules ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 179/179 0:00:00
Traceback (most recent call last):
File "Q:\GitHub\tabbyAPI\start.py", line 298, in <module>
entrypoint(converted_args)
File "Q:\GitHub\tabbyAPI\main.py", line 164, in entrypoint
asyncio.run(entrypoint_async())
File "C:\Python310\lib\asyncio\runners.py", line 44, in run
return loop.run_until_complete(main)
File "C:\Python310\lib\asyncio\base_events.py", line 649, in run_until_complete
return future.result()
File "Q:\GitHub\tabbyAPI\main.py", line 70, in entrypoint_async
await model.load_model(
File "Q:\GitHub\tabbyAPI\common\model.py", line 101, in load_model
async for _ in load_model_gen(model_path, **kwargs):
File "Q:\GitHub\tabbyAPI\common\model.py", line 80, in load_model_gen
async for module, modules in load_status:
File "Q:\GitHub\tabbyAPI\backends\exllamav2\model.py", line 534, in load_gen
async for value in iterate_in_threadpool(model_load_generator):
File "Q:\GitHub\tabbyAPI\common\concurrency.py", line 30, in iterate_in_threadpool
yield await asyncio.to_thread(gen_next, generator)
File "C:\Python310\lib\asyncio\threads.py", line 25, in to_thread
return await loop.run_in_executor(None, func_call)
File "C:\Python310\lib\concurrent\futures\thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
File "Q:\GitHub\tabbyAPI\common\concurrency.py", line 20, in gen_next
return next(generator)
File "C:\Python310\lib\site-packages\torch\utils\_contextlib.py", line 57, in generator_context
response = gen.send(request)
File "Q:\GitHub\tabbyAPI\backends\exllamav2\model.py", line 643, in load_model_sync
self.cache = self.create_cache(
File "Q:\GitHub\tabbyAPI\backends\exllamav2\model.py", line 691, in create_cache
return ExLlamaV2Cache_TP(
File "C:\Python310\lib\site-packages\exllamav2\cache.py", line 680, in __init__
self.caches = [
File "C:\Python310\lib\site-packages\exllamav2\cache.py", line 681, in <listcomp>
base(
File "C:\Python310\lib\site-packages\exllamav2\cache.py", line 596, in __init__
super().__init__(
File "C:\Python310\lib\site-packages\exllamav2\cache.py", line 441, in __init__
self.create_state_tensors(copy_from, lazy)
File "C:\Python310\lib\site-packages\exllamav2\cache.py", line 91, in create_state_tensors
p_key_states = torch.zeros(self.shape_wk, dtype = self.dtype, device = device).contiguous()
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 24.00 MiB. GPU 0 has a total capacity of 24.00 GiB of which 0 bytes is free. Of the allocated memory 22.30 GiB is allocated by PyTorch, and 967.24 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Press any key to continue . .
I searched for that error, and it seems to be quite common with Pytorch recently. https://github.com/pytorch/pytorch/issues/122057 https://github.com/pytorch/torchtune/issues/1185
That comment is interesting : https://github.com/pytorch/pytorch/issues/122057#issuecomment-2315966315
@galv I am not explicitly setting TORCH_NCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK=true. Expandable segments simply stopped working in PyTorch 2.2 due to the refactor https://github.com/pytorch/pytorch/blob/main/c10/cuda/CUDAAllocatorConfig.h#L28. PyTorch 2.1.2 is the last version that works for me with expandable segments -- upgrading to 2.2+ gives this warning and expandable segments are not enabled (and I get OOMs).
I’d have to try loading with the actual env var set to see if there’s an issue with the syntax - I only tested being able to set the var and echo it back.
Regarding the final error at the bottom - this is just a simple OOM error. Are you sure you aren’t just running out of memory with your configuration? The expandable segments thing only saves a small amount of vram anyways.
I notice that you are trying to manually specify a cache size of 10240, however it is being automatically overridden to match max seq len at 131072 because a cache size less than max seq len is not a sane setting. Did you mean to load the model with only 10240 context to take up less vram?
I made work the model several time on TabbyAPI, 3.9bpw yesterday, then, it stopped to work with the usual OOMs, even after reboot, so I decided to dig in the problem. I'm usually an user of the Llama.CPP ecosystem. I tried several time Exllama, but always had those problems of memory management (on Windows 11, and before that Windows 10). Sometimes it worked, sometimes it didn't, for no obvious reason. I'm used to handle very tightly my memory needs on the LCPP ecosystem, but I never could make sense of what happens with Torch/Pytorch, and that's what kept me to regularly use ExlllamaV2.
OH, LOL. My mistake, I just read about the base context. I deleted the max seq len yesterday. Lololol. Still keeps a problem with expandable segment, but.. :D
# Overrides base model context length (default: Empty).
# WARNING: Don't set this unless you know what you're doing!
# Again, do NOT use this for configuring context length, use max_seq_len above ^
override_base_seq_len:
I thought the sole prompt cache dictated the context size and afferent cache, not the max seq length. I deleted it, then forgot about it.. I'm gonna test right now.
I have to reinstall Torch as well, because I messed up my install. I'll notify you as soon as it works.
And it works again. x)
P.S : Thank you for your help!
Yeah don’t use override_base_seq_len. This is a very old feature added for a very niche reason - it’s for setting the model’s effective seq len to use for automatic rope scaling (I.e. the oldest mistral 7b having a max seq len of 32k but really only working up to around 8-9k before breaking down).
Problem
The sign "=" is not supported in Windows environment variables. Thus, PYTORCH_CUDA_ALLOC_CONF=expandable_segments cannot be used on that platform.
Solution
Could you please either give me an alternative route I might have overlooked, or if possible, alllow to set Pytorch memory parameters through the config.yaml of TabbyAPI?
Alternatives
No response
Explanation
To allow a better compatibility with Windows, Pytorch memory management being always a bit tricky there.
Examples
Here's my current log of TabbyAPI
I'm sorry if I post in the wrong place, but because it's Pytorch AND exllamav2 related, posting here seemed sensical.
Additional context
No response
Acknowledgements