Generating data with vLLM (generations + hacky fix to min tokens)

Generating data with vLLM

Some data with temperature of 1 has been generated for all datasets for select prompts. Relatively fast and can be done with other prompts also! Need to look at hyperparams and prompts more closely.

Hacky fix to min tokens

A hacky fix has been introduced to circumvent the lack of min tokens:

The n=1 parameter in the SamplingParams obj. controls how many outputs are generated per prompt. For all rows where the generation fails to meet the min token length, n outputs is extended by 1 and the generations are repeated until all rows have the correct min token length or the threshold of n=30 is reached.
This solution was chosen as opposed to changing seeds continuously as the seed has to be changed when initialising the model and would thus require the model to be initialised many many times (otherwise an alternative would be to generate a set list of numbers to act as seeds to ensure reproducibility)

rbroc / echo

Generating data with vLLM (generations + hacky fix to min tokens) #41

Generating data with vLLM

Hacky fix to min tokens