Open Jackie0601zhou opened 4 months ago
That seems like internet access does not work maybe?
I used the HTTPS url and choose github as gitprovider when I added a repo to databricks. I also installed specific version of various packages. All the steps before trainer_stats = trainer.train() were good. But when I run trainer_stats = trainer.train(), it show:
Currently getting the same issue as Jackie. I'm in a more regulated environment for databricks so I have to first download the repo and install through volumes. I have a suspicion that it's a dependency conflict but not too sure where to start looking.
I'm also getting the same issue. I've tried installing different versions of the packages, but I end up with the same error.
Hmmm wait is databricks using MLFlow?
Yes. By default Databricks logs the runs with MLFlow.
Hmmm ok - oh also is Databricks multi GPU?
In my instance, I'm only using a single GPU. It's possible to set up a multi GPU cluster, though.
Hmmm tbh I haven't tried Databricks so I can't exactly debug it - I'll see what I can do, but can't promise anything sorrty
Wondering if any progress made on this? We are facing same issue, trying to install from source and everything is okay till you hit trainer.train() and it fails with segmentation fault.
Oh no a segfault?? :(
Think I have it tracked down to tensorboard, which means its most likely a databricks runtime fix not a unsloth one..
Fatal Python error: Segmentation fault
Thread 0x00007fe7c01f2640 (most recent call first):
File "/usr/lib/python3.11/threading.py", line 324 in wait
File "/usr/lib/python3.11/queue.py", line 180 in get
File "/databricks/python/lib/python3.11/site-packages/tensorboard/summary/writer/event_file_writer.py", line 269 in _run
File "/databricks/python/lib/python3.11/site-packages/tensorboard/summary/writer/event_file_writer.py", line 244 in run
File "/usr/lib/python3.11/threading.py", line 1038 in _bootstrap_inner
File "/usr/lib/python3.11/threading.py", line 995 in _bootstrap
Tried turning off mlflow, installer older tensorboard.. no luck so far so leaving here if anyone else wants to debug.. (Check the cluster driver logs for more info)
Update: The Segmentation Fault has been raised internally with Databricks, they have it down as a Feature Request. No ETA yet but hopefully those of us on regulated environments will be able to use Unsloth soon.
@julianmukaj Thanks for the update!!
How can I install unsloth on databricks notebook? I tried "pip install "unsloth[cu121-ampere-torch220] @ git+https://github.com/unslothai/unsloth.git" and I met: Cloning https://github.com/unslothai/unsloth.git to /tmp/pip-install-rst_tpkk/unsloth_e8849fa753954ad5b20ad0a81efbd0be Running command git clone --filter=blob:none --quiet https://github.com/unslothai/unsloth.git /tmp/pip-install-rst_tpkk/unsloth_e8849fa753954ad5b20ad0a81efbd0be fatal: unable to access 'https://github.com/unslothai/unsloth.git/': gnutls_handshake() failed: The TLS connection was non-properly terminated. error: subprocess-exited-with-error