CUDA error upon attempting to change the loaded model while using HF 4bit

Attemping to load a new model after the first when using HF 4bit results in a CUDA error:

ERROR      | modeling.inference_models.hf_torch:_get_model:402 - Lazyloader failed, falling back to stock HF load. You may run out of RAM here.
ERROR      | modeling.inference_models.hf_torch:_get_model:403 - CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

ERROR      | modeling.inference_models.hf_torch:_get_model:404 - Traceback (most recent call last):
  File "/home/***/AI/KoboldAI/modeling/inference_models/hf_torch.py", line 392, in _get_model
    model = AutoModelForCausalLM.from_pretrained(
  File "/home/***/AI/KoboldAI/runtime/envs/koboldai/lib/python3.8/site-packages/hf_bleeding_edge/__init__.py", line 59, in from_pretrained
    return AM.from_pretrained(path, *args, **kwargs)
  File "/home/***/AI/KoboldAI/runtime/envs/koboldai/lib/python3.8/site-packages/transformers/models/auto/auto_factory.py", line 493, in from_pretrained
    return model_class.from_pretrained(
  File "/home/***/AI/KoboldAI/modeling/patches.py", line 92, in new_from_pretrained
    return old_from_pretrained(
  File "/home/***/AI/KoboldAI/runtime/envs/koboldai/lib/python3.8/site-packages/transformers/modeling_utils.py", line 2903, in from_pretrained
    ) = cls._load_pretrained_model(
  File "/home/***/AI/KoboldAI/runtime/envs/koboldai/lib/python3.8/site-packages/transformers/modeling_utils.py", line 3260, in _load_pretrained_model
    new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
  File "/home/***/AI/KoboldAI/modeling/patches.py", line 302, in _load_state_dict_into_meta_model
    set_module_quantized_tensor_to_device(
  File "/home/***/AI/KoboldAI/runtime/envs/koboldai/lib/python3.8/site-packages/transformers/utils/bitsandbytes.py", line 109, in set_module_quantized_tensor_to_device
    new_value = value.to(device)
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

If I try launching with CUDA_LAUNCH_BLOCKING=1, it just gets stuck loading the second model (no error) and never finishes.

henk717 / KoboldAI

CUDA error upon attempting to change the loaded model while using HF 4bit #425