ml-explore / mlx-examples

Examples in the MLX framework
MIT License
5.5k stars 791 forks source link

[Question]about creating the 'adapters.npz' file #812

Closed Daniel-Lee closed 3 weeks ago

Daniel-Lee commented 4 weeks ago

I am currently following the finetuning sample of "https://heidloff.net/article/apple-mlx-fine-tuning/" url with mlx_lm, but the 'adapters.npz' file is not created even after iter=600. I don't know if the tutorial URL I'm referring to is incorrect. Could you provide a guide?

Part of the fine tuning code being referenced : !python lora.py \ --model mistralai/Mistral-7B-Instruct-v0.2 \ --train \ --batch-size 1 \ --lora-layers 4 \ --data my-data-text ls -la -rw-r--r-- 1 niklasheidloff staff 1708214 May 12 13:12 adapters.npz

The fine-tuning produces an ‘adapters.npz’ file which can be converted into the safetensors format. <--adapters.npz file is not created!

awni commented 4 weeks ago

Can you show the log of the command you ran? It should save the adapters by default every 100 iterations to a file adapters.npz. Most likely either your command did not run for 100 iterations or you are looking in the wrong place.

Also I'd encourage you to use MLX LM for a more featured fine-tuning package. Here is a complete guide.

Daniel-Lee commented 4 weeks ago

Below, I share the fine tuning code I performed and the output recorded during training.

run : Jupyter Notebook with python=3.11, macos 14.5(Mac pro M1 max 64GB)

*** Code : !python -m mlx_lm.lora \ --model "google/gemma-2b-it" \ --train \ --iters 600 \ --data data \ --steps-per-eval 100 \ --max-seq-length 2400 \ #Due to the nature of the training data I had, max-seq-length was set to 2.4k. --learning-rate 2e-4 \

--resume-adapter-file checkpoints/600_adapters.npz

*** Output: None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used. Loading pretrained model Fetching 9 files: 100%|████████████████████████| 9/9 [00:00<00:00, 85792.58it/s] Loading datasets Training Trainable parameters: 0.033% (0.819M/2506.172M) Starting training..., iters: 600 Iter 1: Val loss 3.936, Val took 215.383s Iter 10: Train loss 2.987, Learning Rate 2.000e-04, It/sec 0.028, Tokens/sec 186.170, Trained Tokens 65620, Peak mem 69.060 GB Iter 20: Train loss 1.504, Learning Rate 2.000e-04, It/sec 0.023, Tokens/sec 178.319, Trained Tokens 142273, Peak mem 69.060 GB Iter 30: Train loss 0.448, Learning Rate 2.000e-04, It/sec 0.022, Tokens/sec 172.839, Trained Tokens 220139, Peak mem 69.060 GB Iter 40: Train loss 0.280, Learning Rate 2.000e-04, It/sec 0.022, Tokens/sec 173.244, Trained Tokens 297254, Peak mem 69.060 GB Iter 50: Train loss 0.237, Learning Rate 2.000e-04, It/sec 0.022, Tokens/sec 172.502, Trained Tokens 374350, Peak mem 69.060 GB Iter 60: Train loss 0.252, Learning Rate 2.000e-04, It/sec 0.025, Tokens/sec 175.211, Trained Tokens 445673, Peak mem 69.060 GB Iter 70: Train loss 0.238, Learning Rate 2.000e-04, It/sec 0.025, Tokens/sec 176.555, Trained Tokens 517723, Peak mem 69.060 GB Iter 80: Train loss 0.139, Learning Rate 2.000e-04, It/sec 0.021, Tokens/sec 176.730, Trained Tokens 600930, Peak mem 69.060 GB Iter 90: Train loss 0.132, Learning Rate 2.000e-04, It/sec 0.020, Tokens/sec 169.215, Trained Tokens 683801, Peak mem 69.060 GB Iter 100: Val loss 0.148, Val took 200.288s Iter 100: Train loss 0.163, Learning Rate 2.000e-04, It/sec 0.218, Tokens/sec 1688.886, Trained Tokens 761390, Peak mem 69.060 GB Iter 100: Saved adapter weights to adapters/adapters.safetensors and adapters/0000100_adapters.safetensors. Iter 110: Train loss 0.174, Learning Rate 2.000e-04, It/sec 0.021, Tokens/sec 165.624, Trained Tokens 839339, Peak mem 69.060 GB Iter 120: Train loss 0.143, Learning Rate 2.000e-04, It/sec 0.020, Tokens/sec 166.529, Trained Tokens 922932, Peak mem 69.060 GB Iter 130: Train loss 0.214, Learning Rate 2.000e-04, It/sec 0.023, Tokens/sec 165.572, Trained Tokens 994731, Peak mem 69.060 GB Iter 140: Train loss 0.162, Learning Rate 2.000e-04, It/sec 0.022, Tokens/sec 170.473, Trained Tokens 1071710, Peak mem 69.060 GB Iter 150: Train loss 0.165, Learning Rate 2.000e-04, It/sec 0.020, Tokens/sec 159.002, Trained Tokens 1149892, Peak mem 69.060 GB Iter 160: Train loss 0.160, Learning Rate 2.000e-04, It/sec 0.021, Tokens/sec 166.218, Trained Tokens 1228023, Peak mem 69.060 GB Iter 170: Train loss 0.181, Learning Rate 2.000e-04, It/sec 0.024, Tokens/sec 172.606, Trained Tokens 1299179, Peak mem 69.060 GB Iter 180: Train loss 0.191, Learning Rate 2.000e-04, It/sec 0.024, Tokens/sec 174.946, Trained Tokens 1370992, Peak mem 69.060 GB Iter 190: Train loss 0.175, Learning Rate 2.000e-04, It/sec 0.025, Tokens/sec 177.959, Trained Tokens 1441732, Peak mem 69.060 GB Iter 200: Val loss 0.123, Val took 214.330s Iter 200: Train loss 0.145, Learning Rate 2.000e-04, It/sec 0.174, Tokens/sec 1354.205, Trained Tokens 1519404, Peak mem 69.060 GB Iter 200: Saved adapter weights to adapters/adapters.safetensors and adapters/0000200_adapters.safetensors. Iter 210: Train loss 0.171, Learning Rate 2.000e-04, It/sec 0.024, Tokens/sec 169.563, Trained Tokens 1590632, Peak mem 69.060 GB Iter 220: Train loss 0.168, Learning Rate 2.000e-04, It/sec 0.023, Tokens/sec 166.420, Trained Tokens 1662813, Peak mem 69.060 GB Iter 230: Train loss 0.147, Learning Rate 2.000e-04, It/sec 0.021, Tokens/sec 163.872, Trained Tokens 1741135, Peak mem 69.178 GB Iter 240: Train loss 0.138, Learning Rate 2.000e-04, It/sec 0.022, Tokens/sec 168.465, Trained Tokens 1818984, Peak mem 69.178 GB Iter 250: Train loss 0.152, Learning Rate 2.000e-04, It/sec 0.023, Tokens/sec 166.229, Trained Tokens 1890048, Peak mem 69.178 GB Iter 260: Train loss 0.156, Learning Rate 2.000e-04, It/sec 0.025, Tokens/sec 179.677, Trained Tokens 1960832, Peak mem 69.178 GB Iter 270: Train loss 0.156, Learning Rate 2.000e-04, It/sec 0.025, Tokens/sec 177.873, Trained Tokens 2031622, Peak mem 69.178 GB Iter 280: Train loss 0.135, Learning Rate 2.000e-04, It/sec 0.022, Tokens/sec 166.591, Trained Tokens 2108826, Peak mem 69.178 GB Iter 290: Train loss 0.104, Learning Rate 2.000e-04, It/sec 0.020, Tokens/sec 165.676, Trained Tokens 2192247, Peak mem 69.178 GB Iter 300: Val loss 0.124, Val took 191.932s Iter 300: Train loss 0.128, Learning Rate 2.000e-04, It/sec 0.218, Tokens/sec 1687.373, Trained Tokens 2269755, Peak mem 69.178 GB Iter 300: Saved adapter weights to adapters/adapters.safetensors and adapters/0000300_adapters.safetensors. Iter 310: Train loss 0.101, Learning Rate 2.000e-04, It/sec 0.019, Tokens/sec 161.071, Trained Tokens 2353171, Peak mem 69.178 GB Iter 320: Train loss 0.135, Learning Rate 2.000e-04, It/sec 0.022, Tokens/sec 168.836, Trained Tokens 2431368, Peak mem 69.178 GB Iter 330: Train loss 0.183, Learning Rate 2.000e-04, It/sec 0.025, Tokens/sec 169.026, Trained Tokens 2497720, Peak mem 69.178 GB Iter 340: Train loss 0.137, Learning Rate 2.000e-04, It/sec 0.022, Tokens/sec 169.529, Trained Tokens 2575376, Peak mem 69.178 GB Iter 350: Train loss 0.130, Learning Rate 2.000e-04, It/sec 0.022, Tokens/sec 170.451, Trained Tokens 2652845, Peak mem 69.178 GB Iter 360: Train loss 0.132, Learning Rate 2.000e-04, It/sec 0.020, Tokens/sec 159.830, Trained Tokens 2730910, Peak mem 69.178 GB Iter 370: Train loss 0.123, Learning Rate 2.000e-04, It/sec 0.021, Tokens/sec 165.865, Trained Tokens 2808359, Peak mem 69.178 GB Iter 380: Train loss 0.131, Learning Rate 2.000e-04, It/sec 0.021, Tokens/sec 160.952, Trained Tokens 2886630, Peak mem 69.178 GB Iter 390: Train loss 0.147, Learning Rate 2.000e-04, It/sec 0.025, Tokens/sec 177.275, Trained Tokens 2957472, Peak mem 69.178 GB Iter 400: Val loss 0.107, Val took 203.717s Iter 400: Train loss 0.125, Learning Rate 2.000e-04, It/sec 0.202, Tokens/sec 1573.826, Trained Tokens 3035261, Peak mem 69.178 GB Iter 400: Saved adapter weights to adapters/adapters.safetensors and adapters/0000400_adapters.safetensors. Iter 410: Train loss 0.095, Learning Rate 2.000e-04, It/sec 0.020, Tokens/sec 169.285, Trained Tokens 3118145, Peak mem 69.178 GB Iter 420: Train loss 0.146, Learning Rate 2.000e-04, It/sec 0.024, Tokens/sec 172.076, Trained Tokens 3189678, Peak mem 69.178 GB Iter 430: Train loss 0.154, Learning Rate 2.000e-04, It/sec 0.024, Tokens/sec 170.905, Trained Tokens 3261046, Peak mem 69.178 GB Iter 440: Train loss 0.119, Learning Rate 2.000e-04, It/sec 0.023, Tokens/sec 177.243, Trained Tokens 3337667, Peak mem 69.178 GB Iter 450: Train loss 0.151, Learning Rate 2.000e-04, It/sec 0.024, Tokens/sec 168.843, Trained Tokens 3409339, Peak mem 69.178 GB Iter 460: Train loss 0.202, Learning Rate 2.000e-04, It/sec 0.034, Tokens/sec 202.335, Trained Tokens 3468201, Peak mem 69.178 GB Iter 470: Train loss 0.118, Learning Rate 2.000e-04, It/sec 0.022, Tokens/sec 170.467, Trained Tokens 3545649, Peak mem 69.178 GB Iter 480: Train loss 0.096, Learning Rate 2.000e-04, It/sec 0.020, Tokens/sec 165.786, Trained Tokens 3628853, Peak mem 69.178 GB Iter 490: Train loss 0.148, Learning Rate 2.000e-04, It/sec 0.025, Tokens/sec 179.053, Trained Tokens 3700354, Peak mem 69.178 GB Iter 500: Val loss 0.110, Val took 201.131s Iter 500: Train loss 0.091, Learning Rate 2.000e-04, It/sec 0.189, Tokens/sec 1565.547, Trained Tokens 3783274, Peak mem 69.178 GB Iter 500: Saved adapter weights to adapters/adapters.safetensors and adapters/0000500_adapters.safetensors. Iter 510: Train loss 0.099, Learning Rate 2.000e-04, It/sec 0.019, Tokens/sec 156.952, Trained Tokens 3867084, Peak mem 69.178 GB Iter 520: Train loss 0.095, Learning Rate 2.000e-04, It/sec 0.020, Tokens/sec 165.927, Trained Tokens 3950769, Peak mem 69.178 GB Iter 530: Train loss 0.100, Learning Rate 2.000e-04, It/sec 0.019, Tokens/sec 163.905, Trained Tokens 4034831, Peak mem 69.178 GB Iter 540: Train loss 0.095, Learning Rate 2.000e-04, It/sec 0.021, Tokens/sec 174.305, Trained Tokens 4118042, Peak mem 69.178 GB Iter 550: Train loss 0.161, Learning Rate 2.000e-04, It/sec 0.027, Tokens/sec 176.647, Trained Tokens 4183819, Peak mem 69.178 GB Iter 560: Train loss 0.148, Learning Rate 2.000e-04, It/sec 0.023, Tokens/sec 168.409, Trained Tokens 4255725, Peak mem 69.178 GB Iter 570: Train loss 0.094, Learning Rate 2.000e-04, It/sec 0.020, Tokens/sec 167.652, Trained Tokens 4339301, Peak mem 69.178 GB Iter 580: Train loss 0.138, Learning Rate 2.000e-04, It/sec 0.024, Tokens/sec 169.197, Trained Tokens 4410586, Peak mem 69.178 GB Iter 590: Train loss 0.094, Learning Rate 2.000e-04, It/sec 0.020, Tokens/sec 167.730, Trained Tokens 4493852, Peak mem 69.178 GB Iter 600: Val loss 0.102, Val took 215.875s Iter 600: Train loss 0.126, Learning Rate 2.000e-04, It/sec 0.214, Tokens/sec 1653.713, Trained Tokens 4570994, Peak mem 69.178 GB Iter 600: Saved adapter weights to adapters/adapters.safetensors and adapters/0000600_adapters.safetensors. Saved final adapter weights to adapters/adapters.safetensors.

awni commented 3 weeks ago

Ah I think the problem is its saving adapters/adapters.safetensors not adapters.npz. Maybe the tutorial you are referring to is somewhat out of date. But either way you should have access to the adapters in adapters/adapters.safetensors