Graph depth exceeded error on default run of LoRa tune

chimezie commented 2 months ago

While running the current git version of mlx_lm on the included training set against a quantized, converted copy of teknium/OpenHermes-2.5-Mistral-7B I have been using for a while without issue, I'm getting a new exception I haven't seen before:

% time python -m mlx_lm.lora --model /path/to/quantized/model --train
[.. snip ..]
Iter 880: Train loss 0.487, Learning Rate 1.000e-05, It/sec 0.841, Tokens/sec 330.704, Trained Tokens 343254
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "[..]/mlx-examples/llms/mlx_lm/lora.py", line 246, in <module>
    run(args)
  File "[..]/mlx-examples/llms/mlx_lm/lora.py", line 202, in run
    train(
  File "[..]/mlx-examples/llms/mlx_lm/tuner/trainer.py", line 182, in train
    mx.eval(model.parameters(), optimizer.state, lvalue)
RuntimeError: [eval] Graph depth exceeded maximum allowed limit. Try evaluating the graph more frequently.

I think this is related to changes from #797

awni commented 2 months ago

That is exceedingly odd.. something must not be getting evaluated that should for it to show up on the 880th iteration. Are you using the lora code unchanged?

awni commented 2 months ago

python -m mlx_lm.lora --model microsoft/phi-2 --train --data ../lora/data --batch-size 1 --iters 1000

That ran fine for 1k iterations. I am trying with this now:

python -m mlx_lm.lora --model mlx-community/Nous-Hermes-2-Mistral-7B-DPO-4bit-MLX --train --data ../lora/data --batch-size 1 --iters 1000

awni commented 2 months ago

This also ran fine for the full 1k iterations:

python -m mlx_lm.lora --model mlx-community/Nous-Hermes-2-Mistral-7B-DPO-4bit-MLX --train --data ../lora/data --batch-size 1 --iters 1000

I would double check you didn't change anything else in the code or model. Without knowing more, I suspect there is a change in there somewhere that is causing the issue.

chimezie commented 2 months ago

That is exceedingly odd.. something must not be getting evaluated that should for it to show up on the 880th iteration. Are you using the lora code unchanged?

Yes. The first thing I did was to make sure I was using the code from the repo without local changes then (re-)installed . The weirder thing is that I can't seem to easily replicate it. Some runs complete without issue and others don't. I'll try your runs on the machine.

awni commented 2 months ago

The weirder thing is that I can't seem to easily replicate it. Some runs complete without issue and others don't.

Any idea on the fraction that fail? Also could you share the MLX version you are using (commit hash if building from source)?

chimezie commented 2 months ago

I don't have a good sense of the fraction that fail, so I'm going to try a few repeated runs of yours locally.

This is the command-line showing the source commit hash and how it was built from source before the runs:

% git pull                        
Already up to date.
% git diff
% git rev-parse HEAD
8c2cf665ed598bf2c9b72b068f93b657f2615122
% pip install -U -e llms 
[.. snip ..]
Successfully built mlx-lm
Installing collected packages: mlx-lm
  Attempting uninstall: mlx-lm
    Found existing installation: mlx-lm 0.1.0
    Uninstalling mlx-lm-0.1.0:
      Successfully uninstalled mlx-lm-0.1.0
Successfully installed mlx-lm-0.1.0

awni commented 2 months ago

What about the MLX version, not the MLX LM version? Are you using 0.6 or building from source?

chimezie commented 2 months ago

Ahh. My bad. I'm building MLX from source as well:

% git pull                            
Already up to date.
% git diff                            
% git rev-parse HEAD                  
28301807c2c5d7c42c25c139d6dfa26a8910438e
% env MACOSX_DEPLOYMENT_TARGET=14.2.1 CC=gcc CXX=g++ SDKROOT=`xcrun --show-sdk-path` CMAKE_BUILD_PARALLEL_LEVEL="" pip install  .                 
[..snip..]
Successfully built mlx
Installing collected packages: mlx
  Attempting uninstall: mlx
    Found existing installation: mlx 0.6.0
    Uninstalling mlx-0.6.0:
      Successfully uninstalled mlx-0.6.0
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
mlx-lm 0.1.0 requires mlx>=0.6, but you have mlx 0.6.0.dev20240308+28301807 which is incompatible.
Successfully installed mlx-0.6.0.dev20240308+28301807

chimezie commented 2 months ago

I'm getting it consistently. 4/4 runs:

Run 1

% python -m mlx_lm.lora --model mlx-community/Nous-Hermes-2-Mistral-7B-DPO-4bit-MLX --train --data lora/data \
                        --batch-size 1 --iters 1000
[..snip..]
Iter 880: Train loss 0.787, Learning Rate 1.000e-05, It/sec 2.733, Tokens/sec 294.029, Trained Tokens 85104
Traceback (most recent call last):
  [..snip..]
  File "[..]/mlx-examples/llms/mlx_lm/tuner/trainer.py", line 182, in train
    mx.eval(model.parameters(), optimizer.state, lvalue)
RuntimeError: [eval] Graph depth exceeded maximum allowed limit. Try evaluating the graph more frequently.

Run 2 (duplicate command line)
[..snip..]
Iter 880: Train loss 0.780, Learning Rate 1.000e-05, It/sec 2.729, Tokens/sec 293.630, Trained Tokens 85104
[ .. same traceback .. ]

Run 3
[ .. Same traceback at the same iteration ..]

Run 4
[ .. Same traceback at the same iteration ..]

awni commented 2 months ago

What machine / OS ?

awni commented 2 months ago

I was able to repro, looks like a bug somewhere in MLX LM or MLX, looking into it.

chimezie commented 2 months ago

What machine / OS ?

Apple Mac Studio M1 Ultra

awni commented 2 months ago

@chimezie there is a bug in our core library which is causing. The fix may take some more time but in the meantime, I put a patch in the mlx-example you can use (just don't use dropout for now) and your models should run.

chimezie commented 2 months ago

Excellent. Thanks!

ml-explore / mlx

Graph depth exceeded error on default run of LoRa tune #810