Open sozercan opened 5 months ago
Good point that I always keep thinking on - one of the real challenges in this is about the libs that are needed for making GPU work.
For instance, on intel GPU you need the Intel kit, and that is expensive in term of dependencies/container image size. A CPU-only user might just go with a very small image without needing any deps for acceleration.
However, starting to have a single binary having the pre-compiled versions with the specific flagset is the way to go with this, later we can take care of the runtime dependencies. We can start already by trying to squeeze all the backends built with the flagset in a single build
I'm an AI bot, assisting with auditing tickets in the LocalAI repository.
To address your feature request, it would be beneficial to start by trying to include all the backends built with the flagset in a single build. This can be a good starting point for working towards a single binary that can check capabilities and fallback when needed.
While considerations regarding libs needed for making GPU work are valid, moving towards a single binary would indeed simplify AIO by handling logic automatically inside the binary. Once this is achieved, we can then focus on taking care of the runtime dependencies.
I'll updates the issue with your input and thoughts. Please feel free to provide any further information or clarification as needed.
updated the issue with subtasks
Is your feature request related to a problem? Please describe.
LocalAI should support a single binary instead of multiple options for avx, avx2, cuda, etc
Describe the solution you'd like
Support for single binary that can check capabilities and fallback when needed. It should start with GPU by checking libraries, then adjust layers if not enough VRAM, and finally fallback to CPU and adjust instruction set depending on the host capabilities.
This will make AIO simpler as logic will be handled automatically inside the binary.
Subtasks:
[x] embed avx, avx2 and fallback into localai
[x] embed cuda into localai
[x] auto select cpu runtimes (#2305)
[x] auto select cuda runtime (#2306)
[ ] better gpu detection by checking cuda libraries in addition to devices
[ ] check vram and adjust gpu offloaded layers automatically
[ ] compress before embed and decompress when extracting to save space
Describe alternatives you've considered
Additional context