build(deps): update bitsandbytes requirement from ~=0.40.2 to ~=0.41.0

Updates the requirements on bitsandbytes to permit the latest version.

Release notes

Bug and CUDA fixes + performance

Release 0.41.0 features an overhaul of the CUDA_SETUP routine. We trust PyTorch to find the proper CUDA binaries and use those. If you use a CUDA version that differs from PyTorch, you can now control the binary that is loaded for bitsandbytes by setting the BNB_CUDA_VERSION variable. See the custom CUDA guide for more information.

Besides that, this release features a wide range of bug fixes, CUDA 11.8 support for Ada and Hopper GPUs, and an update for 4-bit inference performance.

Previous 4-bit inference kernels were optimized for RTX 4090 and Ampere A40 GPUs, but the performance was poor for A100 GPUs, which are common. In this release, A100 performance is slightly improved (40%) and is not faster than 16-bit inference, while RTX 4090 and A40 is slightly lower (10% lower).

This leads to approximate speedups compared to 16-bit (BF16) of roughly:

RTX 4090: 3.8x

RTX 3090 / A40: 3.1x

A100: 1.5x

RTX 6000: 1.3x

RTX 2080 Ti: 1.1x

0.41.0

Features:

Added precompiled CUDA 11.8 binaries to support H100 GPUs without compilation #571

CUDA SETUP now no longer looks for libcuda and libcudart and relies PyTorch CUDA libraries. To manually override this behavior see: how_to_use_nonpytorch_cuda.md. Thank you @rapsealk

Bug fixes:

Fixed a bug where the default type of absmax was undefined which leads to errors if the default type is different than torch.float32. # 553

Fixed a missing scipy dependency in requirements.txt. #544

Fixed a bug, where a view operation could cause an error in 8-bit layers.

Fixed a bug where CPU bitsandbytes would during the import. #593 Thank you @bilelomrani

Fixed a but where a non-existent LD_LIBRARY_PATH variable led to a failure in python -m bitsandbytes #588

Removed outdated get_cuda_lib_handle calls that lead to errors. #595 Thank you @ihsanturk

Fixed bug where read-permission was assumed for a file. #497

Fixed a bug where prefetchAsync lead to errors on GPUs that do not support unified memory but not prefetching (Maxwell, SM52). #470 #451 #453 #477 Thank you @jllllll and @stoperro

Documentation:

Improved documentation for GPUs that do not support 8-bit matmul. #529

Added description and pointers for the NF4 data type. #543

User experience:

Improved handling of default compute_dtype for Linear4bit Layers, so that compute_dtype = input_dtype if the input data type is stable enough (float32, bfloat16, but not float16).

Performance:

improved 4-bit inference performance for A100 GPUs. This degraded performance for A40/RTX3090 and RTX 4090 GPUs slightly.

Deprecated:

8-bit quantization and optimizers that do not use blockwise quantization will be removed on 0.42.0. All blockwise methods will remain fully supported.

Changelog

Sourced from bitsandbytes's changelog.

0.41.0

Features:

Added precompiled CUDA 11.8 binaries to support H100 GPUs without compilation #571

CUDA SETUP now no longer looks for libcuda and libcudart and relies PyTorch CUDA libraries. To manually override this behavior see: how_to_use_nonpytorch_cuda.md. Thank you @rapsealk

Bug fixes:

Fixed a bug where the default type of absmax was undefined which leads to errors if the default type is different than torch.float32. # 553

Fixed a missing scipy dependency in requirements.txt. #544

Fixed a bug, where a view operation could cause an error in 8-bit layers.

Fixed a bug where CPU bitsandbytes would during the import. #593 Thank you @bilelomrani

Fixed a but where a non-existent LD_LIBRARY_PATH variable led to a failure in python -m bitsandbytes #588

Removed outdated get_cuda_lib_handle calls that lead to errors. #595 Thank you @ihsanturk

Fixed bug where read-permission was assumed for a file. #497

Fixed a bug where prefetchAsync lead to errors on GPUs that do not support unified memory but not prefetching (Maxwell, SM52). #470 #451 #453 #477 Thank you @jllllll and @stoperro

Documentation:

Improved documentation for GPUs that do not support 8-bit matmul. #529

Added description and pointers for the NF4 data type. #543

User experience:

Improved handling of default compute_dtype for Linear4bit Layers, so that compute_dtype = input_dtype if the input data type is stable enough (float32, bfloat16, but not float16).

Performance:

improved 4-bit inference performance for A100 GPUs. This degraded performance for A40/RTX3090 and RTX 4090 GPUs slightly.

Commits

See full diff in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

hyperonym / basaran

build(deps): update bitsandbytes requirement from ~=0.40.2 to ~=0.41.0 #233

Bug and CUDA fixes + performance

0.41.0

0.41.0