instructlab / sdg

Python library for Synthetic Data Generation
https://pypi.org/project/instructlab-sdg/
Apache License 2.0
22 stars 35 forks source link

Support more than 3 qna per context chunk #232

Open markmc opened 3 months ago

markmc commented 3 months ago

From @mairin

In the current implementation of SDG, we have a knowledge YAML format that requires the inclusion of context chunks that connect to qna (questions and answers) in the file that SDG uses to generate data.

Currently, this requires having 3 qna samples per context chunk. No more, no less. If you provide less I believe it won't work, and if you provide more, any qna beyond the first 3 will be ignored.

This should be configurable and more robust in future releases if possible.

In the schema, we have minItems=3

And all of the knowledge prompts in the simple, full, and agentic pipelines only handle icl_query_{1,2,3}

shivchander commented 3 months ago

Good feature to have in future - to convert the prompt templates we have into jinja templates and dynamically build them based on the number of icls