instructlab / sdg

Python library for Synthetic Data Generation
Apache License 2.0
12 stars 28 forks source link

No feedback from ilab data generate #259

Open jjasghar opened 1 month ago

jjasghar commented 1 month ago

Describe the bug When I run ilab data generate there is no update or output like 0.17.1.

(venv-instructlab-3.11) ➜  instructlab ilab data generate
INFO 2024-08-08 16:00:04,437 numexpr.utils:161: NumExpr defaulting to 16 threads.
INFO 2024-08-08 16:00:06,691 datasets:58: PyTorch version 2.3.1 available.
INFO 2024-08-08 16:00:09,439 instructlab.model.backends.llama_cpp:103: Trying to connect to model server at http://127.0.0.1:8000/v1
WARNING 2024-08-08 16:00:20,922 instructlab.data.generate:287: Disabling SDG batching - unsupported with llama.cpp serving
Generating synthetic data using 'simple' pipeline, '/Users/jjasghar/Library/Caches/instructlab/models/merlinite-7b-lab-Q4_K_M.gguf' model, '/Users/jjasghar/Library/Application Support/instructlab/taxonomy' taxonomy, against http://127.0.0.1:64282/v1 server
INFO 2024-08-08 16:00:21,711 instructlab.sdg:375: Synthesizing new instructions. If you aren't satisfied with the generated instructions, interrupt training (Ctrl-C) and try adjusting your YAML files. Adding more examples may help.
INFO 2024-08-08 16:00:21,718 instructlab.sdg.pipeline:153: Running pipeline single-threaded
INFO 2024-08-08 16:00:24,444 instructlab.sdg.llmblock:51: LLM server supports batched inputs: False
INFO 2024-08-08 16:00:24,444 instructlab.sdg.pipeline:197: Running block: gen_knowledge
INFO 2024-08-08 16:00:24,444 instructlab.sdg.pipeline:198: Dataset({
    features: ['icl_document', 'document', 'document_outline', 'domain', 'icl_query_1', 'icl_query_2', 'icl_query_3', 'icl_response_1', 'icl_response_2', 'icl_response_3'],
    num_rows: 25
})

This is after 20+ mins on a Mac M3, Activity monitor says "Python" is running, but I don't see anything.

jjasghar commented 1 month ago

Ah it seems after 26 mins this appeared:

INFO 2024-08-08 16:26:38,293 instructlab.sdg:411: Generated 1 samples
INFO 2024-08-08 16:26:38,293 instructlab.sdg.pipeline:153: Running pipeline single-threaded
INFO 2024-08-08 16:26:38,294 instructlab.sdg.pipeline:197: Running block: gen_mmlu_knowledge
INFO 2024-08-08 16:26:38,294 instructlab.sdg.pipeline:198: Dataset({
    features: ['icl_document', 'document', 'document_outline', 'domain', 'icl_query_1', 'icl_query_2', 'icl_query_3', 'icl_response_1', 'icl_response_2', 'icl_response_3'],
    num_rows: 25
})

There needs to be some feedback saying it's running so people don't "crtl-c" out of it thinking it's broken when it first starts.

jjasghar commented 1 month ago

Could this be actually just using my "CPU" per instructlab/instructlab#2028 and not my GPU at all? Even though i have ran:

pip cache remove llama_cpp_python
pip install --force-reinstall llama_cpp_python==0.2.75 -C cmake.args="-DLLAMA_METAL=on

To make sure my llama_cpp_python has the Apple Metal enabled?

bjhargrave commented 1 month ago

@jjasghar, You can use asitop (brew install asitop) to confirm your GPU usage on your Mac.

nathan-weinberg commented 1 month ago

Train profiles don't have any bearing on SDG, separate components

jjasghar commented 1 month ago

Well, it did finish, and it per @bjhargrave 's suggestion it looks like my GPU is being used.

INFO 2024-08-08 20:02:00,978 instructlab.sdg:438: Generation took 12617.11s

I would like to say it did take 3 hours, and the "yep I'm running" would have been nice to have feedback.

nathan-weinberg commented 1 month ago

Something like a progress bar would be a good indicator - not sure if that change would be for the CLI or SDG lib (I assume the former)

bjhargrave commented 2 weeks ago

Using ilab -v data generate can provide more output from DEBUG logging just to let you know things are happening.