sgl-project / sglang

SGLang is a fast serving framework for large language models and vision language models.
https://sgl-project.github.io/
Apache License 2.0
6.11k stars 513 forks source link

[Feature] Per-request random seed #1335

Open laoconeth opened 2 months ago

laoconeth commented 2 months ago

Checklist

Motivation

I believe there is an option for fixing the random seed for the backend, but I think there isn't a feature for per-request random seeds.

Related resources

No response

merrymercy commented 2 months ago

contributions are welcome, the related code is https://github.com/sgl-project/sglang/blob/dff2860a690757966e408b598a8f0b47a29a4713/python/sglang/srt/layers/sampler.py#L83-L85

aflah02 commented 2 months ago

Hi @merrymercy Just spotted this issue when I was looking for another thing. Is there a way to set a universal seed to make things reproducible? Same requests getting same output? I am not concerned with a per-request seed just something you set once. I couldn't find it

merrymercy commented 1 month ago

@aflah02 This is the global random seed. https://github.com/sgl-project/sglang/blob/39bb49d156f2319d2aec67c458c2db980bb0f4c3/python/sglang/srt/server_args.py#L362

However, depending on your workloads (e.g., batch size), determinism is still very hard to achieve

FredericOdermatt commented 1 month ago

I run inference with run_batch for a SglSelect choice on different datasets and have done the following steps in order to make the generation deterministic, but have not been successful yet:

I know that many of these flags disabling optimizations won't be necessary in the end. I was looking for a deterministic setting before reducing changed settings again.

Server side:

python3 -m sglang.launch_server --model-path mistralai/Mixtral-8x22B-Instruct-v0.1 --attention-backend triton --sampling-backend pytorch --disable-radix-cache --disable-regex-jump-forward --disable-cuda-graph --disable-cuda-graph-padding --disable-disk-cache --disable-custom-all-reduce --disable-mla --random-seed 42 --tp-size 8 --dp-size 1 --host 0.0.0.0 --port 30001

Any idea what could still be missing @merrymercy ? If we figure it out, I can work on a PR

merrymercy commented 1 month ago

Can you try --max-running-request 1?

FredericOdermatt commented 1 month ago

Thanks, adding --max-running-request 1 made it reproducible.

Running consecutive runs of run_batch on hundreds of sample datapoints returns the same logits per sample both for multiple run_batch commands and of course over server restarts. Great!

One thing I've tried is readding the radix cache, however when removing the --disable-radix-cache flag, the logits are no longer reproduable. I haven't looked at reducing the other flags yet.

The current setup comes with quite the speed decrease of course, but for "offline" use cases such as synthetic data creation, LLM as a judge (in various structured generation / CoT setups) having a slower generation/evaluation would be acceptable and total reproducibility highly valued.

FredericOdermatt commented 1 month ago

Quick Observation: Going for the setup described as above, the output is currently seed invariant.

I.e. there is no sampling involved anymore: independent of random_seed value, the output is the same