instructlab / sdg

Python library for Synthetic Data Generation
Apache License 2.0
12 stars 28 forks source link

Set a default `seed` value for gen_kwargs #169

Open russellb opened 1 month ago

russellb commented 1 month ago

PR #137 set a seed in one case, but it turns out we could just set a default for all cases instead.

from @markmc

How would you feel about reverting https://github.com/instructlab/sdg/pull/137 and just adding seed=42 here:

  self.defaults = {
           "model": self.ctx.model_id,
           "temperature": 0,
           "max_tokens": 4096,
       }

question:

Would there be a downside to always specifying a seed? So the pipeline author never needs to think about it?

answer from @shivchander

we can default the seed to some specific value to make things simpler. Has no effect when temp is set to 0 - so shouldnt be an issue

derekhiggins commented 1 month ago

Wouldn't this make subsequent runs of synthetic data deterministic (given the same input) ? is this the behaviour desired?

markmc commented 1 month ago

Wouldn't this make subsequent runs of synthetic data deterministic (given the same input) ? is this the behaviour desired?

Great question, @derekhiggins !

Some additional context from @shivchander that came before what is quoted above:

Because all other LLMBlocks have temperature set to 0

when you set temperate=0 - this is what we call greedy sampling. The language model generates the same response during repeated calls.

But when we set a non zero temperature - we introduce stochasticity (which we want for gen_contexts coz we are asking the model to generate 10 responses - and we want these to be unique - so we set the temp to 1.0)

We are using a seed so that we can reproduce the results, in case we need to debug

I think the above does indeed miss a problem with using seed with temperature>0 with backends that do not support batching

With batching: given a seed, the server will generate a sequence of responses in a single call, and that sequence will repeatable

Without batching: given a seed, the server will generate a single, repeatable response to every call, meaning we will generate a sequence of identical samples - instead, we need to generate a sequence of random seeds (one for each request) from the seed!

In other words, something like this:

        if not self.server_supports_batched:
            seedinator = random.Random(self.gen_kwargs.get("seed", 42))
            results = []
            for prompt in prompts:
                for _ in range(self.gen_kwargs.get("n", 1)):
                    if gen_kwargs.get("temperature", 0") > 0:
                         gen_kwargs["seed"] = seedinator.randint(0, 1000)
                    response = self.ctx.client.completions.create(
                        prompt=prompt, **self.gen_kwargs
                    )
                results.append(response.choices[0].text.strip())
        return results