ml-explore / mlx-examples

Examples in the MLX framework
MIT License
6.27k stars 895 forks source link

"Your computer was restarted because of a problem" after QLoRa on Mistral Nemo fused and (later) quantized checkpoint #1091

Closed chimezie closed 2 weeks ago

chimezie commented 3 weeks ago

My computer keeps getting consistently restarted by a LoRa run against 4-bit float32 quants (via current mlx_lm.convert) of older checkpoints (fused about august, prior to the #1062 PR) that are themselves LoRa fine-tunes of Mistral Nemo. I'm getting the same problem with freshly quantized Mistral Nemo models (no adapters) of the same datasets on an Apple M1 Max 32GB on macOS Sequoia 15.1

mlx % git rev-parse HEAD
9bd3a7102fc1dccb10e2a93c320e77f4d32c01de
mlx-examples % git rev-parse HEAD
0f799947d0c73ff4901ce17188aceaa933b3c02e
model: "/path/to/Mistral-Nemo/fused-lora-checkpoint/4-bit-float32-quantized"
train: true

data: "[..]"

num_layers: 20
batch_size: 4

learning_rate: 6e-6

adapter_path: "[..]"

seed: 42

eos_token: "[/INST]"

lr_schedule:
  name: "cosine_decay"
  warmup: 5600
  warmup_init: 1e-8
  arguments: [6e-6, 100000, 1e-7]

lora_parameters:
  keys: ["self_attn.q_proj", "self_attn.v_proj", "self_attn.k_proj", "self_attn.o_proj"]
  rank: 32
  dropout: 0.3205 
  scale: 20.0

The process gets beyond the first evaluation, with the following validation loss and time:

Iter 1: Val loss 2.366, Val took 8634.032s

Then, at some point much later, the machine restarts with the following error in the GUI:

Your computer was restarted because of a problem

This is the first part of the problem report for macOS:

panic(cpu 0 caller 0xfffffe0022646190): watchdog timeout: no checkins from watchdogd in 92 seconds (3893 total checkins since monitoring last enabled)
Debugger message: panic
Memory ID: 0x1
OS release type: User
OS version: 24B83
Kernel version: Darwin Kernel Version 24.1.0: Thu Oct 10 21:03:15 PDT 2024; root:xnu-11215.41.3~2/RELEASE_ARM64_T6000
Fileset Kernelcache UUID: A22C571B61C24E448B9EFFEEFF5F8CC7
Kernel UUID: 8FF94A3F-7153-35AD-8150-EC096C2596DE
Boot session UUID: 338E844A-A05C-4DB2-9298-502C4744352E
iBoot version: iBoot-11881.41.5
secure boot?: YES
roots installed: 0
Paniclog version: 14
KernelCache slide: 0x0000000019470000
KernelCache base:  0xfffffe0020474000
Kernel slide:      0x0000000019478000
Kernel text base:  0xfffffe002047c000
Kernel text exec slide: 0x000000001ab84000
Kernel text exec base:  0xfffffe0021b88000
mach_absolute_time: 0xda080a41a7

[..snip..]

The earlier checkpoint had the following relevant configuration parameters after quantitation:

    "quantization": {
        "group_size": 64,
        "bits": 4
    },
    "quantization_config": {
        "group_size": 64,
        "bits": 4
    },

It was the result of a larger LoRa run that I wanted to amend with a relatively smaller dataset. The smaller dataset I was using for amendment has 275K records totaling 44M tokens, 434.157 tokens per step, and an average of 162 tokens per record. Two of the records had more than 2048 and were subject to truncation. There was also a validation set of 32K records.

The command-line attempt from above to QLora train the checkpoint on the smaller dataset was:

% mlx_lm.lora --val-batches 2711 \
              --steps-per-report 1032 \
              --steps-per-eval 17210 \
              --save-every 34421 \
              --iters 103264 -c path/to/config.yaml

The full dataset has 945K records, comprising the smaller public datasets and a larger (private) dataset, for a total of 85.6M tokens, at 242 per step, an average of 91 tokens per record, and a validation set of 75K records.

I tried a QLora run using the larger dataset against a newly downloaded quantized Nemo model (no adapters) but still get that hard restart after the initial evaluation and subsequent training/model evaluation:

% mlx_lm.lora --val-batches 6239 \
            --steps-per-report 3543 \
            --steps-per-eval 59051 \
            --save-every 118103 \
            --iters 354309 -c /path/to/config.yaml

In all cases, I have wired my system to 18GB, per the recent mlx_lm generating documentation regarding large models and wiring the memory occupied by the model and cache, since I only had 32GB and Nemo may be large relative to the total RAM available on the machine for mlx (if not for inference, then perhaps for training).

sudo sysctl iogpu.wired_limit_mb=18000
chimezie commented 3 weeks ago

I tried running a LoRa run (no quantization) using the smaller dataset against the older, fused checkpoint and it is still running without a hard restart (the failed attempts had restarted by this point):

Iter 1: Val loss 4.171, Val took 3415.301s
Iter 849: Train loss 1.753, Learning Rate 9.171e-07, It/sec 0.274, Tokens/sec 72.280, Trained Tokens 224201, Peak mem 43.625 GB
Iter 1698: Train loss 0.996, Learning Rate 1.825e-06, It/sec 0.273, Tokens/sec 71.112, Trained Tokens 445725, Peak mem 43.625 GB
Iter 2547: Train loss 0.913, Learning Rate 2.733e-06, It/sec 0.275, Tokens/sec 72.344, Trained Tokens 668681, Peak mem 48.001 GB
Iter 3396: Train loss 0.875, Learning Rate 3.641e-06, It/sec 0.113, Tokens/sec 29.801, Trained Tokens 892774, Peak mem 48.001 GB
Iter 4245: Train loss 0.856, Learning Rate 4.550e-06, It/sec 0.135, Tokens/sec 35.567, Trained Tokens 1116519, Peak mem 48.001 GB
Iter 5094: Train loss 0.839, Learning Rate 5.458e-06, It/sec 0.135, Tokens/sec 35.386, Trained Tokens 1338887, Peak mem 48.001 GB
awni commented 3 weeks ago

Does it crash if you don't set the syctl limit?

chimezie commented 3 weeks ago

Does it crash if you don't set the syctl limit?

From what I can recall when I initially started trying to narrow this down, yes. But, once this run completes (or fails) or try again w/out the limit

awni commented 3 weeks ago

Thanks! So just to be sure:

chimezie commented 3 weeks ago

The QLoRA run crashes on both the larger and smaller datasets. The regular LoRA was still running earlier today, many hours after the point at which the QLoRA runs were crashing. So, I'm not sure it will run to completion, but it looks very much like it will so far.

I was planning to try LoRA on the larger one if the smaller one runs to completion to see if it gets beyond the first validation/eval and at least two reportings of loss. But so far, quantization seems to be the main factor behind when it consistently crashes.

awni commented 3 weeks ago

wiring the memory occupied by the model and cache

What do you mean by that?

chimezie commented 3 weeks ago

wiring the memory occupied by the model and cache

What do you mean by that?

I saw the new section on large models and what says about 'wiring' them if "are large relative to the total RAM available on the machine" and wasn't sure, given the size of the model I'm working (12.2B) and the memory I have available on my M1 Max (32GB), if the suggestion was relevant to this situation or if it was only appropriate for memory management during generation.

awni commented 3 weeks ago

Got it. Are you using mx.metal.set_wired_limit or just setting the sysctl? Either way it would be good to know if it still crashes if you don't use those features (don't set the wired limit in MLX and unset the sysctl).

chimezie commented 2 weeks ago

I'm using sysctl and not mx.metal.set_wired_limit . I started the smaller QLoRA run without this setting, and it (like the large LoRA attempt) also made it beyond the point where the earlier attempts were crashing.