instructlab / sdg

Python library for Synthetic Data Generation
https://pypi.org/project/instructlab-sdg/
Apache License 2.0
23 stars 36 forks source link

feat: expose max_num_tokens as configurable #340

Closed cdoern closed 2 weeks ago

cdoern commented 2 weeks ago

max-num-tokens is a nice way to run a shorter or longer SDG run. locally I have been modifiyng the pipeline yaml from 2048 to 512 which ends up just generating less data exposing this to the CLI could allow power users to run different types of SDG runs!

cdoern commented 2 weeks ago

The issue here was the following:

_gen_kwargs sets a default of 4096 for max_tokens unless one is explicitly set in the pipeline yaml for the block.

This means I had to set the top level max_num_tokens to 4096 rather than None. We then use that 4096 value if no other was specified in the block and also use it if the user didn't pass an override.

bbrowning commented 2 weeks ago

@cdoern I may not quite be following the reasoning behind this change. Today, we set a max number of tokens for each step in the pipeline yamls. This allows the user to override what we specified in the yaml? Or this allows them to specify a value that gets used for any pipeline steps that don't have a max tokens specified?

cdoern commented 2 weeks ago

@bbrowning more so This allows the user to override what we specified in the yaml.

@aakankshaduggal indicated that this is a good path forward for allowing CLI users to experiment with different wait times/types of results from SDG and that she has used this toggle locally to achieve different types of results.

max_tokens is already applied to each LLMBlock via gen_kwargs and it set to 4096, this change maintains that unless max_num_tokens is passed into generate_data

bbrowning commented 2 weeks ago

So, my only concern here is that we have downstream pipelines that apply a different max tokens for every step in the pipeline. And the general API surface area of the pipeline config is each step can specify it's own max tokens. This gives a single max tokens for the entire pipeline if passed in, right? As opposed to letting me just tweak the step(s) in the pipeline where I actually want to change the token count?

mergify[bot] commented 2 weeks ago

This pull request has merge conflicts that must be resolved before it can be merged. @cdoern please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork