unslothai / unsloth

Finetune Llama 3.2, Mistral, Phi, Qwen & Gemma LLMs 2-5x faster with 80% less memory
https://unsloth.ai
Apache License 2.0
17.81k stars 1.24k forks source link

Kernel check ignores LTS versions #1076

Open alansill opened 1 month ago

alansill commented 1 month ago

We have a number of our cluster users trying out unsloth. Because we run LTS kernel versions, the kernel version check built into unsloth produces confusion. The kernel LTS versions are numerically much lower than the recommended levels, but as you know, are back-ported to maintain functionality similar to newer kernel versions. Almost no HPC clusters will be running with kernel versions as high as the minimum that unsloth checks for. (For details, see https://access.redhat.com/support/policy/updates/errata and related listes for other distro releases.) I suggest that the unsloth kernel checks be refactored with this consideration in mind. Unlike personal and hobbyist machines, large clusters almost never run on the frequently updated schedules of the unstable branch.

danielhanchen commented 1 month ago

@alansill Hey Alan - sorry on the delay! Oh do you mean Unsloth's python dependencies should be pinned to a version to reduce dependency issues? Or maybe I'm mistaken?

alansill commented 1 month ago

No, I mean that when run on a Linux system with an LTS kernel, the code throws a message warning that it might hang on kernels with versions less than 5.5. But the kernel for Enterprise Linux for example is always numerically at a much lower version number as it is generally intended for use with only necessary patches for years, unlike the ones used in distributions such as Fedora or Ubuntu.

danielhanchen commented 1 month ago

Ohh ok ok so the actual Linux kernel version - hmmm - unfortunately Unsloth relies on Pytorch and newer CUDA versions - that might be the culprit.

Unsloth does support Torch 2.1 and CUDA 11.8, so these might be more stable for older kernel versions

alansill commented 1 month ago

Thanks. The point is that the kernel message is spurious. It only applies to unstable-branch kernels. Most HPC clusters run on the stable branch, in which kernels stay at numerically lower versions even though they have security and other needed patches for long-term continuous use for years. Is there any particular reason for the kernel version check to exist at all?