Closed chimezie closed 2 weeks ago
I tried running a LoRa run (no quantization) using the smaller dataset against the older, fused checkpoint and it is still running without a hard restart (the failed attempts had restarted by this point):
Iter 1: Val loss 4.171, Val took 3415.301s
Iter 849: Train loss 1.753, Learning Rate 9.171e-07, It/sec 0.274, Tokens/sec 72.280, Trained Tokens 224201, Peak mem 43.625 GB
Iter 1698: Train loss 0.996, Learning Rate 1.825e-06, It/sec 0.273, Tokens/sec 71.112, Trained Tokens 445725, Peak mem 43.625 GB
Iter 2547: Train loss 0.913, Learning Rate 2.733e-06, It/sec 0.275, Tokens/sec 72.344, Trained Tokens 668681, Peak mem 48.001 GB
Iter 3396: Train loss 0.875, Learning Rate 3.641e-06, It/sec 0.113, Tokens/sec 29.801, Trained Tokens 892774, Peak mem 48.001 GB
Iter 4245: Train loss 0.856, Learning Rate 4.550e-06, It/sec 0.135, Tokens/sec 35.567, Trained Tokens 1116519, Peak mem 48.001 GB
Iter 5094: Train loss 0.839, Learning Rate 5.458e-06, It/sec 0.135, Tokens/sec 35.386, Trained Tokens 1338887, Peak mem 48.001 GB
Does it crash if you don't set the syctl
limit?
Does it crash if you don't set the
syctl
limit?
From what I can recall when I initially started trying to narrow this down, yes. But, once this run completes (or fails) or try again w/out the limit
Thanks! So just to be sure:
The QLoRA run crashes on both the larger and smaller datasets. The regular LoRA was still running earlier today, many hours after the point at which the QLoRA runs were crashing. So, I'm not sure it will run to completion, but it looks very much like it will so far.
I was planning to try LoRA on the larger one if the smaller one runs to completion to see if it gets beyond the first validation/eval and at least two reportings of loss. But so far, quantization seems to be the main factor behind when it consistently crashes.
wiring the memory occupied by the model and cache
What do you mean by that?
wiring the memory occupied by the model and cache
What do you mean by that?
I saw the new section on large models and what says about 'wiring' them if "are large relative to the total RAM available on the machine" and wasn't sure, given the size of the model I'm working (12.2B) and the memory I have available on my M1 Max (32GB), if the suggestion was relevant to this situation or if it was only appropriate for memory management during generation.
Got it. Are you using mx.metal.set_wired_limit
or just setting the sysctl? Either way it would be good to know if it still crashes if you don't use those features (don't set the wired limit in MLX and unset the sysctl).
I'm using sysctl and not mx.metal.set_wired_limit
. I started the smaller QLoRA run without this setting, and it (like the large LoRA attempt) also made it beyond the point where the earlier attempts were crashing.
My computer keeps getting consistently restarted by a LoRa run against 4-bit float32 quants (via current mlx_lm.convert) of older checkpoints (fused about august, prior to the #1062 PR) that are themselves LoRa fine-tunes of Mistral Nemo. I'm getting the same problem with freshly quantized Mistral Nemo models (no adapters) of the same datasets on an Apple M1 Max 32GB on macOS Sequoia 15.1
The process gets beyond the first evaluation, with the following validation loss and time:
Then, at some point much later, the machine restarts with the following error in the GUI:
Your computer was restarted because of a problem
This is the first part of the problem report for macOS:
The earlier checkpoint had the following relevant configuration parameters after quantitation:
It was the result of a larger LoRa run that I wanted to amend with a relatively smaller dataset. The smaller dataset I was using for amendment has 275K records totaling 44M tokens, 434.157 tokens per step, and an average of 162 tokens per record. Two of the records had more than 2048 and were subject to truncation. There was also a validation set of 32K records.
The command-line attempt from above to QLora train the checkpoint on the smaller dataset was:
The full dataset has 945K records, comprising the smaller public datasets and a larger (private) dataset, for a total of 85.6M tokens, at 242 per step, an average of 91 tokens per record, and a validation set of 75K records.
I tried a QLora run using the larger dataset against a newly downloaded quantized Nemo model (no adapters) but still get that hard restart after the initial evaluation and subsequent training/model evaluation:
In all cases, I have wired my system to 18GB, per the recent mlx_lm generating documentation regarding large models and wiring the memory occupied by the model and cache, since I only had 32GB and Nemo may be large relative to the total RAM available on the machine for mlx (if not for inference, then perhaps for training).