instructlab / instructlab

InstructLab Command-Line Interface. Use this to chat with a model and execute the InstructLab workflow to train a model using custom taxonomy data.
https://instructlab.ai
Apache License 2.0
998 stars 342 forks source link

Not able to train the model on Mac M1 because of Adapter file does not exist #2110

Open ahmed-azraq opened 3 months ago

ahmed-azraq commented 3 months ago

Describe the bug I did ilab model train, but ilab model test failed with

OTE: Adapter file does not exist. Testing behavior before training only. - /Users/ahmedazraq/Library/Application Support/instructlab/checkpoints/adapters.npz
'/Users/ahmedazraq/Library/Application Support/instructlab/internal/test.jsonl' not such file or directory. Did you run 'ilab model train'?

To Reproduce Steps to reproduce the behavior:

  1. ilab data generate
  2. ilab model train
  3. ilab model test

Expected behavior It should work properly.

Screenshots

Full logs are below

ilab data generate
INFO 2024-08-20 12:32:23,398 numexpr.utils:161: NumExpr defaulting to 10 threads.
INFO 2024-08-20 12:32:26,987 datasets:59: PyTorch version 2.3.1 available.
INFO 2024-08-20 12:32:32,240 instructlab.model.backends.llama_cpp:104: Trying to connect to model server at http://127.0.0.1:8000/v1
WARNING 2024-08-20 12:32:51,457 instructlab.data.generate:291: Disabling SDG batching - unsupported with llama.cpp serving
Generating synthetic data using 'simple' pipeline, '/Users/ahmedazraq/Library/Caches/instructlab/models/merlinite-7b-lab-Q4_K_M.gguf' model, '/Users/ahmedazraq/Documents/watsonx/instructlabv3/taxonomy' taxonomy, against http://127.0.0.1:59807/v1 server
ERROR 2024-08-20 12:32:51,634 instructlab.sdg.utils.taxonomy:218: Version 1 is not supported for knowledge taxonomy. Minimum supported version is 3.
1 taxonomy files with errors! Exiting.
(venv) ahmedazraq@Ahmeds-MBP-2 taxonomy % ls
CODE_OF_CONDUCT.md  CONTRIBUTOR_ROLES.md    MAINTAINERS.md      README.md       compositional_skills    foundational_skills knowledge
CONTRIBUTING.md     LICENSE         Makefile        SECURITY.md     docs            governance.md       scripts
(venv) ahmedazraq@Ahmeds-MBP-2 taxonomy % ilab data generate
INFO 2024-08-20 12:35:54,755 numexpr.utils:161: NumExpr defaulting to 10 threads.
INFO 2024-08-20 12:35:55,027 datasets:59: PyTorch version 2.3.1 available.
INFO 2024-08-20 12:35:55,766 instructlab.model.backends.llama_cpp:104: Trying to connect to model server at http://127.0.0.1:8000/v1
WARNING 2024-08-20 12:36:03,900 instructlab.data.generate:291: Disabling SDG batching - unsupported with llama.cpp serving
Generating synthetic data using 'simple' pipeline, '/Users/ahmedazraq/Library/Caches/instructlab/models/merlinite-7b-lab-Q4_K_M.gguf' model, '/Users/ahmedazraq/Documents/watsonx/instructlabv3/taxonomy' taxonomy, against http://127.0.0.1:59867/v1 server
INFO 2024-08-20 12:36:05,837 instructlab.sdg:375: Synthesizing new instructions. If you aren't satisfied with the generated instructions, interrupt training (Ctrl-C) and try adjusting your YAML files. Adding more examples may help.
INFO 2024-08-20 12:36:05,846 instructlab.sdg.pipeline:153: Running pipeline single-threaded
INFO 2024-08-20 12:36:08,342 instructlab.sdg.llmblock:51: LLM server supports batched inputs: False
INFO 2024-08-20 12:36:08,342 instructlab.sdg.pipeline:197: Running block: gen_knowledge
INFO 2024-08-20 12:36:08,342 instructlab.sdg.pipeline:198: Dataset({
    features: ['icl_document', 'document', 'document_outline', 'domain', 'icl_query_1', 'icl_query_2', 'icl_query_3', 'icl_response_1', 'icl_response_2', 'icl_response_3'],
    num_rows: 10
})
INFO 2024-08-20 12:42:51,163 instructlab.sdg:411: Generated 1 samples
INFO 2024-08-20 12:42:51,163 instructlab.sdg.pipeline:153: Running pipeline single-threaded
INFO 2024-08-20 12:42:51,165 instructlab.sdg.pipeline:197: Running block: gen_mmlu_knowledge
INFO 2024-08-20 12:42:51,165 instructlab.sdg.pipeline:198: Dataset({
    features: ['icl_document', 'document', 'document_outline', 'domain', 'icl_query_1', 'icl_query_2', 'icl_query_3', 'icl_response_1', 'icl_response_2', 'icl_response_3'],
    num_rows: 10
})
INFO 2024-08-20 12:43:17,121 instructlab.sdg.eval_data:126: Saving MMLU Dataset /Users/ahmedazraq/Library/Application Support/instructlab/datasets/node_datasets_2024-08-20T12_36_05/mmlubench_knowledge_history_biography_egypt_hikmat_abu_zayd.jsonl
Creating json from Arrow format: 0ba [00:00, ?ba/s]
INFO 2024-08-20 12:43:17,125 instructlab.sdg.eval_data:130: Saving MMLU Task yaml /Users/ahmedazraq/Library/Application Support/instructlab/datasets/node_datasets_2024-08-20T12_36_05/knowledge_history_biography_egypt_hikmat_abu_zayd_task.yaml
Map: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 282/282 [00:00<00:00, 14503.20 examples/s]
Map: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 282/282 [00:00<00:00, 29093.44 examples/s]
Creating json from Arrow format: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 83.31ba/s]
Map: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 282/282 [00:00<00:00, 4795.43 examples/s]
Map: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 282/282 [00:00<00:00, 11238.69 examples/s]
Creating json from Arrow format: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 47.53ba/s]
INFO 2024-08-20 12:43:17,302 instructlab.sdg.datamixing:123: Loading dataset from /Users/ahmedazraq/Library/Application Support/instructlab/datasets/node_datasets_2024-08-20T12_36_05/knowledge_history_biography_egypt_hikmat_abu_zayd_p07.jsonl ...
Generating train split: 282 examples [00:00, 40483.07 examples/s]
INFO 2024-08-20 12:43:23,939 instructlab.sdg.datamixing:125: Dataset columns: ['messages', 'metadata', 'id']
INFO 2024-08-20 12:43:23,939 instructlab.sdg.datamixing:126: Dataset loaded with 282 samples
Map (num_proc=8): 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 282/282 [00:00<00:00, 2205.61 examples/s]
Map (num_proc=8): 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 282/282 [00:00<00:00, 3101.38 examples/s]
Creating json from Arrow format: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 62.03ba/s]
INFO 2024-08-20 12:43:24,261 instructlab.sdg.datamixing:200: Mixed Dataset saved to /Users/ahmedazraq/Library/Application Support/instructlab/datasets/knowledge_train_msgs_2024-08-20T12_36_05.jsonl
INFO 2024-08-20 12:43:24,262 instructlab.sdg.datamixing:123: Loading dataset from /Users/ahmedazraq/Library/Application Support/instructlab/datasets/node_datasets_2024-08-20T12_36_05/knowledge_history_biography_egypt_hikmat_abu_zayd_p10.jsonl ...
Generating train split: 282 examples [00:00, 22838.26 examples/s]
INFO 2024-08-20 12:43:24,981 instructlab.sdg.datamixing:125: Dataset columns: ['messages', 'metadata', 'id']
INFO 2024-08-20 12:43:24,981 instructlab.sdg.datamixing:126: Dataset loaded with 282 samples
Map (num_proc=8): 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 282/282 [00:00<00:00, 2968.46 examples/s]
Map (num_proc=8): 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 282/282 [00:00<00:00, 3046.94 examples/s]
Creating json from Arrow format: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 41.55ba/s]
INFO 2024-08-20 12:43:25,274 instructlab.sdg.datamixing:200: Mixed Dataset saved to /Users/ahmedazraq/Library/Application Support/instructlab/datasets/skills_train_msgs_2024-08-20T12_36_05.jsonl
INFO 2024-08-20 12:43:25,274 instructlab.sdg:438: Generation took 441.37s

% ilab model train
[INFO] Loading
Fetching 11 files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11/11 [00:00<00:00, 5780.17it/s]
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.
dtype=mlx.core.float16
[INFO] Quantizing
Using model_type='llama'
Loading pretrained model
Using model_type='llama'
Total parameters 1165.829M
Trainable parameters 2.097M
Loading datasets
*********

ᕙ(•̀‸•́‶)ᕗ  Training has started! ᕙ(•̀‸•́‶)ᕗ 

*********
Epoch 1: Iter 1: Val loss 3.989, Val took 21.893s
Iter 010: Train loss 3.166, It/sec 0.353, Tokens/sec 165.830
Epoch 1: Iter 10: Val loss 2.442, Val took 21.691s
Iter 10: Saved adapter weights to /Users/ahmedazraq/Library/Application Support/instructlab/checkpoints/instructlab-granite-7b-lab-mlx-q/adapters-010.npz.
Iter 020: Train loss 1.699, It/sec 0.293, Tokens/sec 144.834
Epoch 1: Iter 20: Val loss 1.263, Val took 21.712s
Iter 20: Saved adapter weights to /Users/ahmedazraq/Library/Application Support/instructlab/checkpoints/instructlab-granite-7b-lab-mlx-q/adapters-020.npz.
Iter 030: Train loss 0.954, It/sec 0.328, Tokens/sec 151.129
Epoch 1: Iter 30: Val loss 0.938, Val took 21.726s
Iter 30: Saved adapter weights to /Users/ahmedazraq/Library/Application Support/instructlab/checkpoints/instructlab-granite-7b-lab-mlx-q/adapters-030.npz.
Iter 040: Train loss 0.612, It/sec 0.408, Tokens/sec 170.164
Epoch 1: Iter 40: Val loss 0.776, Val took 21.678s
Iter 40: Saved adapter weights to /Users/ahmedazraq/Library/Application Support/instructlab/checkpoints/instructlab-granite-7b-lab-mlx-q/adapters-040.npz.
Iter 050: Train loss 0.577, It/sec 0.352, Tokens/sec 153.652
Epoch 1: Iter 50: Val loss 0.705, Val took 21.779s
Iter 50: Saved adapter weights to /Users/ahmedazraq/Library/Application Support/instructlab/checkpoints/instructlab-granite-7b-lab-mlx-q/adapters-050.npz.
Iter 060: Train loss 0.520, It/sec 0.356, Tokens/sec 154.561
Epoch 2: Iter 60: Val loss 0.654, Val took 22.002s
Iter 60: Saved adapter weights to /Users/ahmedazraq/Library/Application Support/instructlab/checkpoints/instructlab-granite-7b-lab-mlx-q/adapters-060.npz.
Iter 070: Train loss 0.414, It/sec 0.346, Tokens/sec 149.091
Epoch 2: Iter 70: Val loss 0.610, Val took 22.210s
Iter 70: Saved adapter weights to /Users/ahmedazraq/Library/Application Support/instructlab/checkpoints/instructlab-granite-7b-lab-mlx-q/adapters-070.npz.
Iter 080: Train loss 0.535, It/sec 0.235, Tokens/sec 116.641
Epoch 2: Iter 80: Val loss 0.583, Val took 22.060s
Iter 80: Saved adapter weights to /Users/ahmedazraq/Library/Application Support/instructlab/checkpoints/instructlab-granite-7b-lab-mlx-q/adapters-080.npz.
Iter 090: Train loss 0.370, It/sec 0.362, Tokens/sec 164.072
Epoch 2: Iter 90: Val loss 0.573, Val took 21.792s
Iter 90: Saved adapter weights to /Users/ahmedazraq/Library/Application Support/instructlab/checkpoints/instructlab-granite-7b-lab-mlx-q/adapters-090.npz.
Iter 100: Train loss 0.422, It/sec 0.335, Tokens/sec 158.425
Epoch 2: Iter 100: Val loss 0.547, Val took 22.283s
Iter 100: Saved adapter weights to /Users/ahmedazraq/Library/Application Support/instructlab/checkpoints/instructlab-granite-7b-lab-mlx-q/adapters-100.npz.
(venv) ahmedazraq@Ahmeds-MBP-2 taxonomy % ilab model test 
NOTE: Adapter file does not exist. Testing behavior before training only. - /Users/ahmedazraq/Library/Application Support/instructlab/checkpoints/adapters.npz
'/Users/ahmedazraq/Library/Application Support/instructlab/internal/test.jsonl' not such file or directory. Did you run 'ilab model train'?

Device Info (please complete the following information):

Additional context It might be related to #1883 but all what I want is a workaround for now so that I proceed.

ahmed-azraq commented 3 months ago

@jaideepr97 and @cdoern have been very supportive in long slack discussion, and provided this workaround which works fine.

ilab model test --adapter-file "/Users/ahmedazraq/Library/Application Support/instructlab/checkpoints/instructlab-granite-7b-lab-mlx-q/adapters-100.npz" --data-dir "/Users/ahmedazraq/Library/Application Support/instructlab/datasets" --model-dir "/Users/ahmedazraq/Library/Application Support/instructlab/checkpoints/instructlab-granite-7b-lab-mlx-q"

ilab model convert --adapter-file "/Users/ahmedazraq/Library/Application Support/instructlab/checkpoints/instructlab-granite-7b-lab-mlx-q/adapters-100.npz" --model-dir "/Users/ahmedazraq/Library/Application Support/instructlab/checkpoints/instructlab-granite-7b-lab-mlx-q"

They have mentioned in upcoming CLI release, they shall submit a PR to adjust the default behavior to work directly without the additional arguments. I will keep this issue open in case you'd like to submit PR on it. Feel free to close it if you'd like as the workaround works fine.

Thanks a lot @jaideepr97 and @cdoern for your support, guidance, patience, and for being so responsive. This is highly appreciated! 🙇‍♂️ 🙏