Closed chimezie closed 2 months ago
That is exceedingly odd.. something must not be getting evaluated that should for it to show up on the 880th iteration. Are you using the lora code unchanged?
python -m mlx_lm.lora --model microsoft/phi-2 --train --data ../lora/data --batch-size 1 --iters 1000
That ran fine for 1k iterations. I am trying with this now:
python -m mlx_lm.lora --model mlx-community/Nous-Hermes-2-Mistral-7B-DPO-4bit-MLX --train --data ../lora/data --batch-size 1 --iters 1000
This also ran fine for the full 1k iterations:
python -m mlx_lm.lora --model mlx-community/Nous-Hermes-2-Mistral-7B-DPO-4bit-MLX --train --data ../lora/data --batch-size 1 --iters 1000
I would double check you didn't change anything else in the code or model. Without knowing more, I suspect there is a change in there somewhere that is causing the issue.
That is exceedingly odd.. something must not be getting evaluated that should for it to show up on the 880th iteration. Are you using the lora code unchanged?
Yes. The first thing I did was to make sure I was using the code from the repo without local changes then (re-)installed . The weirder thing is that I can't seem to easily replicate it. Some runs complete without issue and others don't. I'll try your runs on the machine.
The weirder thing is that I can't seem to easily replicate it. Some runs complete without issue and others don't.
Any idea on the fraction that fail? Also could you share the MLX version you are using (commit hash if building from source)?
I don't have a good sense of the fraction that fail, so I'm going to try a few repeated runs of yours locally.
This is the command-line showing the source commit hash and how it was built from source before the runs:
% git pull
Already up to date.
% git diff
% git rev-parse HEAD
8c2cf665ed598bf2c9b72b068f93b657f2615122
% pip install -U -e llms
[.. snip ..]
Successfully built mlx-lm
Installing collected packages: mlx-lm
Attempting uninstall: mlx-lm
Found existing installation: mlx-lm 0.1.0
Uninstalling mlx-lm-0.1.0:
Successfully uninstalled mlx-lm-0.1.0
Successfully installed mlx-lm-0.1.0
What about the MLX version, not the MLX LM version? Are you using 0.6 or building from source?
Ahh. My bad. I'm building MLX from source as well:
% git pull
Already up to date.
% git diff
% git rev-parse HEAD
28301807c2c5d7c42c25c139d6dfa26a8910438e
% env MACOSX_DEPLOYMENT_TARGET=14.2.1 CC=gcc CXX=g++ SDKROOT=`xcrun --show-sdk-path` CMAKE_BUILD_PARALLEL_LEVEL="" pip install .
[..snip..]
Successfully built mlx
Installing collected packages: mlx
Attempting uninstall: mlx
Found existing installation: mlx 0.6.0
Uninstalling mlx-0.6.0:
Successfully uninstalled mlx-0.6.0
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
mlx-lm 0.1.0 requires mlx>=0.6, but you have mlx 0.6.0.dev20240308+28301807 which is incompatible.
Successfully installed mlx-0.6.0.dev20240308+28301807
I'm getting it consistently. 4/4 runs:
Run 1
% python -m mlx_lm.lora --model mlx-community/Nous-Hermes-2-Mistral-7B-DPO-4bit-MLX --train --data lora/data \
--batch-size 1 --iters 1000
[..snip..]
Iter 880: Train loss 0.787, Learning Rate 1.000e-05, It/sec 2.733, Tokens/sec 294.029, Trained Tokens 85104
Traceback (most recent call last):
[..snip..]
File "[..]/mlx-examples/llms/mlx_lm/tuner/trainer.py", line 182, in train
mx.eval(model.parameters(), optimizer.state, lvalue)
RuntimeError: [eval] Graph depth exceeded maximum allowed limit. Try evaluating the graph more frequently.
Run 2 (duplicate command line)
[..snip..]
Iter 880: Train loss 0.780, Learning Rate 1.000e-05, It/sec 2.729, Tokens/sec 293.630, Trained Tokens 85104
[ .. same traceback .. ]
Run 3
[ .. Same traceback at the same iteration ..]
Run 4
[ .. Same traceback at the same iteration ..]
What machine / OS ?
I was able to repro, looks like a bug somewhere in MLX LM or MLX, looking into it.
What machine / OS ?
Apple Mac Studio M1 Ultra
@chimezie there is a bug in our core library which is causing. The fix may take some more time but in the meantime, I put a patch in the mlx-example you can use (just don't use dropout for now) and your models should run.
Excellent. Thanks!
While running the current git version of mlx_lm on the included training set against a quantized, converted copy of teknium/OpenHermes-2.5-Mistral-7B I have been using for a while without issue, I'm getting a new exception I haven't seen before:
I think this is related to changes from #797