[REQUEST] Alternative way to the Pytorch environment variables on Windows to set Pytorch memory management parameters

Problem

The sign "=" is not supported in Windows environment variables. Thus, PYTORCH_CUDA_ALLOC_CONF=expandable_segments cannot be used on that platform.

Solution

Could you please either give me an alternative route I might have overlooked, or if possible, alllow to set Pytorch memory parameters through the config.yaml of TabbyAPI?

Alternatives

No response

Explanation

To allow a better compatibility with Windows, Pytorch memory management being always a bit tricky there.

Examples

Here's my current log of TabbyAPI

"p_value_states = torch.zeros(self.shape_wv, dtype = self.dtype, device = device).contiguous()
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 24.00 MiB. GPU 0 has a total capacity of 24.00 GiB of which 0 bytes is free. Of the allocated memory 21.11 GiB is allocated by PyTorch, and **2.12 GiB is reserved by PyTorch but unallocated**. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)"

I'm sorry if I post in the wrong place, but because it's Pytorch AND exllamav2 related, posting here seemed sensical.

Additional context

No response

Acknowledgements

[X] I have looked for similar requests before submitting this one.
[X] I understand that the developers have lives and my issue will be answered when possible.
[X] I understand the developers of this program are human, and I will make my requests politely.

Setting environment variables in Windows can be done using both cmd.exe and PowerShell. Below are the methods for each:

Using `cmd.exe`

Set a Temporary Environment Variable: This variable will only be available in the current command prompt session.
```
set MY_VARIABLE=value
```
Set a Permanent Environment Variable: This variable will be available system-wide and persist across sessions.
```
setx MY_VARIABLE "value"
```
Set a User-Specific Environment Variable: This variable will be available only for the current user.
```
setx MY_VARIABLE "value" /M
```

Using PowerShell

Set a Temporary Environment Variable: This variable will only be available in the current PowerShell session.
```
$env:MY_VARIABLE = "value"
```
Set a Persistent Environment Variable: This variable will be available system-wide and persist after the session ends.
- For the Current User:
- For the Machine (System-Wide):
View Environment Variables: You can view the current environment variables using:
```
Get-ChildItem Env:
```

Remove an Environment Variable:

For the Current Session:
```
Remove-Item Env:MY_VARIABLE
```

Permanently:

[System.Environment]::SetEnvironmentVariable('MY_VARIABLE', $null, 'User')
[System.Environment]::SetEnvironmentvariable('MY_VARIABLE', $null, 'Machine')

Example Usage

`cmd.exe`

set MY_VARIABLE=HelloWorld
setx MY_VARIABLE "HelloWorld"

PowerShell

$env:MY_VARIABLE = "HelloWorld"
[System.Environment]::SetEnvironmentVariable("MY_VARIABLE", "HelloWorld", "User")

Notes

When setting environment variables using setx, the changes will not take effect in the current session. You need to open a new command prompt or PowerShell window to see the changes.
Environment variable names are case-insensitive in Windows.

By following these steps, you can effectively manage environment variables in both cmd.exe and PowerShell on Windows.

Hey Doc,

I tried all possible syntaxes for PYTORCH_CUDA_ALLOC_CONF

expandable_segments:true expandable_segments=true expandable_segments:True expandable_segments=True expandable_segments:1 expandable_segments=1

Both system-wide and as user. Always same answer :

pip 24.3.1 from C:\Python310\lib\site-packages\pip (python 3.10)
Loaded your saved preferences from `start_options.json`
Traceback (most recent call last):
  File "Q:\GitHub\tabbyAPI\start.py", line 276, in <module>
    from main import entrypoint
  File "Q:\GitHub\tabbyAPI\main.py", line 11, in <module>
    from common import gen_logging, sampling, model
  File "Q:\GitHub\tabbyAPI\common\model.py", line 19, in <module>
    from backends.exllamav2.model import ExllamaV2Container
  File "Q:\GitHub\tabbyAPI\backends\exllamav2\model.py", line 12, in <module>
    from exllamav2 import (
  File "C:\Python310\lib\site-packages\exllamav2\__init__.py", line 3, in <module>
    from exllamav2.model import ExLlamaV2
  File "C:\Python310\lib\site-packages\exllamav2\model.py", line 41, in <module>
    from exllamav2.attn import ExLlamaV2Attention, has_flash_attn, has_xformers
  File "C:\Python310\lib\site-packages\exllamav2\attn.py", line 38, in <module>
    is_ampere_or_newer_gpu = any(torch.cuda.get_device_properties(i).major >= 8 for i in range(torch.cuda.device_count()))
  File "C:\Python310\lib\site-packages\exllamav2\attn.py", line 38, in <genexpr>
    is_ampere_or_newer_gpu = any(torch.cuda.get_device_properties(i).major >= 8 for i in range(torch.cuda.device_count()))
  File "C:\Python310\lib\site-packages\torch\cuda\__init__.py", line 465, in get_device_properties
    _lazy_init()  # will define _get_device_properties
  File "C:\Python310\lib\site-packages\torch\cuda\__init__.py", line 314, in _lazy_init
    torch._C._cuda_init()
RuntimeError: Expected a single True/False argument for expandable_segments
Press any key to continue . . .

So I tried to modify start.py in Tabby API to be sure of having a correct syntax. And here's what I got at start :

pip 24.3.1 from C:\Python310\lib\site-packages\pip (python 3.10)
Loaded your saved preferences from `start_options.json`
Starting TabbyAPI...
INFO:     ExllamaV2 version: 0.2.3
WARNING:  Disabling authentication makes your instance vulnerable. Set the `disable_auth` flag to False in config.yml if
you want to share this instance with others.
INFO:     Generation logging is disabled
WARNING:  Draft model is disabled because a model name wasn't provided. Please check your config.yml!
WARNING:  The given cache_size (10240) is smaller than the desired context length.
WARNING:  Overriding cache_size to max_seq_len.
WARNING:  The given cache_size (131072) is less than 2 * max_seq_len and may be too small for requests using CFG.
WARNING:  Ignore this warning if you do not plan on using CFG.
INFO:     Attempting to load a prompt template if present.
INFO:     Using template "from_tokenizer_config" for chat completions.
INFO:     Loading model: X:\TGW\models\Mistral-Large-Instruct-2407-3.9bpw-h6-exl2-0.2.3
INFO:     Loading with tensor parallel
C:\Python310\lib\site-packages\exllamav2\stloader.py:157: UserWarning: expandable_segments not supported on this
platform (Triggered internally at
C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\c10/cuda/CUDAAllocatorConfig.h:28.)
  tensor = torch.zeros(shape, dtype = dtype, device = device)

And finally, the load crashes :

Loading model modules ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 179/179 0:00:00
Traceback (most recent call last):
  File "Q:\GitHub\tabbyAPI\start.py", line 298, in <module>
    entrypoint(converted_args)
  File "Q:\GitHub\tabbyAPI\main.py", line 164, in entrypoint
    asyncio.run(entrypoint_async())
  File "C:\Python310\lib\asyncio\runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "C:\Python310\lib\asyncio\base_events.py", line 649, in run_until_complete
    return future.result()
  File "Q:\GitHub\tabbyAPI\main.py", line 70, in entrypoint_async
    await model.load_model(
  File "Q:\GitHub\tabbyAPI\common\model.py", line 101, in load_model
    async for _ in load_model_gen(model_path, **kwargs):
  File "Q:\GitHub\tabbyAPI\common\model.py", line 80, in load_model_gen
    async for module, modules in load_status:
  File "Q:\GitHub\tabbyAPI\backends\exllamav2\model.py", line 534, in load_gen
    async for value in iterate_in_threadpool(model_load_generator):
  File "Q:\GitHub\tabbyAPI\common\concurrency.py", line 30, in iterate_in_threadpool
    yield await asyncio.to_thread(gen_next, generator)
  File "C:\Python310\lib\asyncio\threads.py", line 25, in to_thread
    return await loop.run_in_executor(None, func_call)
  File "C:\Python310\lib\concurrent\futures\thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "Q:\GitHub\tabbyAPI\common\concurrency.py", line 20, in gen_next
    return next(generator)
  File "C:\Python310\lib\site-packages\torch\utils\_contextlib.py", line 57, in generator_context
    response = gen.send(request)
  File "Q:\GitHub\tabbyAPI\backends\exllamav2\model.py", line 643, in load_model_sync
    self.cache = self.create_cache(
  File "Q:\GitHub\tabbyAPI\backends\exllamav2\model.py", line 691, in create_cache
    return ExLlamaV2Cache_TP(
  File "C:\Python310\lib\site-packages\exllamav2\cache.py", line 680, in __init__
    self.caches = [
  File "C:\Python310\lib\site-packages\exllamav2\cache.py", line 681, in <listcomp>
    base(
  File "C:\Python310\lib\site-packages\exllamav2\cache.py", line 596, in __init__
    super().__init__(
  File "C:\Python310\lib\site-packages\exllamav2\cache.py", line 441, in __init__
    self.create_state_tensors(copy_from, lazy)
  File "C:\Python310\lib\site-packages\exllamav2\cache.py", line 91, in create_state_tensors
    p_key_states = torch.zeros(self.shape_wk, dtype = self.dtype, device = device).contiguous()
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 24.00 MiB. GPU 0 has a total capacity of 24.00 GiB of which 0 bytes is free. Of the allocated memory 22.30 GiB is allocated by PyTorch, and 967.24 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Press any key to continue . .

I searched for that error, and it seems to be quite common with Pytorch recently. https://github.com/pytorch/pytorch/issues/122057 https://github.com/pytorch/torchtune/issues/1185

That comment is interesting : https://github.com/pytorch/pytorch/issues/122057#issuecomment-2315966315

@galv I am not explicitly setting TORCH_NCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK=true. Expandable segments simply stopped working in PyTorch 2.2 due to the refactor https://github.com/pytorch/pytorch/blob/main/c10/cuda/CUDAAllocatorConfig.h#L28. PyTorch 2.1.2 is the last version that works for me with expandable segments -- upgrading to 2.2+ gives this warning and expandable segments are not enabled (and I get OOMs).

I’d have to try loading with the actual env var set to see if there’s an issue with the syntax - I only tested being able to set the var and echo it back.

Regarding the final error at the bottom - this is just a simple OOM error. Are you sure you aren’t just running out of memory with your configuration? The expandable segments thing only saves a small amount of vram anyways.

I notice that you are trying to manually specify a cache size of 10240, however it is being automatically overridden to match max seq len at 131072 because a cache size less than max seq len is not a sane setting. Did you mean to load the model with only 10240 context to take up less vram?

I made work the model several time on TabbyAPI, 3.9bpw yesterday, then, it stopped to work with the usual OOMs, even after reboot, so I decided to dig in the problem. I'm usually an user of the Llama.CPP ecosystem. I tried several time Exllama, but always had those problems of memory management (on Windows 11, and before that Windows 10). Sometimes it worked, sometimes it didn't, for no obvious reason. I'm used to handle very tightly my memory needs on the LCPP ecosystem, but I never could make sense of what happens with Torch/Pytorch, and that's what kept me to regularly use ExlllamaV2.

OH, LOL. My mistake, I just read about the base context. I deleted the max seq len yesterday. Lololol. Still keeps a problem with expandable segment, but.. :D

  # Overrides base model context length (default: Empty).
  # WARNING: Don't set this unless you know what you're doing!
  # Again, do NOT use this for configuring context length, use max_seq_len above ^
  override_base_seq_len:

I thought the sole prompt cache dictated the context size and afferent cache, not the max seq length. I deleted it, then forgot about it.. I'm gonna test right now.

I have to reinstall Torch as well, because I messed up my install. I'll notify you as soon as it works.

And it works again. x)

P.S : Thank you for your help!

Yeah don’t use override_base_seq_len. This is a very old feature added for a very niche reason - it’s for setting the model’s effective seq len to use for automatic rope scaling (I.e. the oldest mistral 7b having a max seq len of 32k but really only working up to around 8-9k before breaking down).

turboderp / exllamav2