ml-explore / mlx-examples

Examples in the MLX framework
MIT License
6.18k stars 873 forks source link

Peak mem 201 GB running on M2 Ultra 192 GB, how is this possible? #873

Closed alphrc closed 4 months ago

alphrc commented 4 months ago

I am fine-tuning a 72B model Qwen/Qwen2-72B-Instruct with ~50000 data. The peak mem reaches 201GB but my machine only have 192GB RAM. It is still running fine at the moment. How is this possible? Would there be any problem in the future?

Loading pretrained model
Fetching 44 files: 100%|████████████████████████████████████████████████████████████████| 44/44 [00:00<00:00, 67575.75it/s]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Loading datasets
Training
Trainable parameters: 0.005% (3.277M/72706.204M)
Starting training..., iters: 400
Iter 1: Val loss 2.866, Val took 840.432s
Iter 10: Train loss 2.959, Learning Rate 1.000e-05, It/sec 0.024, Tokens/sec 45.325, Trained Tokens 18995, Peak mem 182.359 GB
Iter 20: Train loss 2.752, Learning Rate 1.000e-05, It/sec 0.059, Tokens/sec 92.930, Trained Tokens 34731, Peak mem 182.359 GB
Iter 30: Train loss 2.407, Learning Rate 1.000e-05, It/sec 0.030, Tokens/sec 61.197, Trained Tokens 54922, Peak mem 182.359 GB
Iter 40: Train loss 1.979, Learning Rate 1.000e-05, It/sec 0.030, Tokens/sec 52.952, Trained Tokens 72432, Peak mem 182.359 GB
Iter 50: Train loss 1.810, Learning Rate 1.000e-05, It/sec 0.024, Tokens/sec 49.628, Trained Tokens 92852, Peak mem 200.919 GB
Iter 60: Train loss 1.515, Learning Rate 1.000e-05, It/sec 0.027, Tokens/sec 55.897, Trained Tokens 113732, Peak mem 200.919 GB
Iter 70: Train loss 1.472, Learning Rate 1.000e-05, It/sec 0.037, Tokens/sec 59.750, Trained Tokens 129951, Peak mem 200.919 GB
Iter 80: Train loss 1.330, Learning Rate 1.000e-05, It/sec 0.028, Tokens/sec 48.389, Trained Tokens 147179, Peak mem 200.919 GB
Iter 90: Train loss 1.158, Learning Rate 1.000e-05, It/sec 0.046, Tokens/sec 68.470, Trained Tokens 161968, Peak mem 200.919 GB
Iter 100: Train loss 1.160, Learning Rate 1.000e-05, It/sec 0.066, Tokens/sec 90.592, Trained Tokens 175789, Peak mem 200.919 GB
Iter 100: Saved adapter weights to adapters/adapters.safetensors and adapters/0000100_adapters.safetensors.
Iter 110: Train loss 1.277, Learning Rate 1.000e-05, It/sec 0.043, Tokens/sec 70.588, Trained Tokens 192067, Peak mem 200.919 GB
Iter 120: Train loss 1.249, Learning Rate 1.000e-05, It/sec 0.039, Tokens/sec 60.980, Trained Tokens 207868, Peak mem 200.919 GB
Iter 130: Train loss 1.222, Learning Rate 1.000e-05, It/sec 0.051, Tokens/sec 79.233, Trained Tokens 223549, Peak mem 200.919 GB
Iter 140: Train loss 1.293, Learning Rate 1.000e-05, It/sec 0.037, Tokens/sec 71.510, Trained Tokens 242630, Peak mem 200.919 GB
Iter 150: Train loss 1.369, Learning Rate 1.000e-05, It/sec 0.023, Tokens/sec 42.854, Trained Tokens 261454, Peak mem 200.919 GB
Iter 160: Train loss 1.227, Learning Rate 1.000e-05, It/sec 0.029, Tokens/sec 47.694, Trained Tokens 277891, Peak mem 200.919 GB
Iter 170: Train loss 1.209, Learning Rate 1.000e-05, It/sec 0.022, Tokens/sec 40.991, Trained Tokens 296917, Peak mem 200.919 GB
awni commented 4 months ago

The OS will "swap" in which some active memory gets stored on "disk" if you use more memory than the machine has. This can be very slow so it's best to avoid it as much as possible. In your case if it only happens once in a while then it's probably fine. If it happens regularly then maybe your toks/sec will start to get really slow.