Open laoconeth opened 2 months ago
contributions are welcome, the related code is https://github.com/sgl-project/sglang/blob/dff2860a690757966e408b598a8f0b47a29a4713/python/sglang/srt/layers/sampler.py#L83-L85
Hi @merrymercy Just spotted this issue when I was looking for another thing. Is there a way to set a universal seed to make things reproducible? Same requests getting same output? I am not concerned with a per-request seed just something you set once. I couldn't find it
@aflah02 This is the global random seed. https://github.com/sgl-project/sglang/blob/39bb49d156f2319d2aec67c458c2db980bb0f4c3/python/sglang/srt/server_args.py#L362
However, depending on your workloads (e.g., batch size), determinism is still very hard to achieve
I run inference with run_batch
for a SglSelect
choice on different datasets and have done the following steps in order to make the generation deterministic, but have not been successful yet:
--random-seed
set_random_seed(self.random_seed)
at the beginning of
https://github.com/sgl-project/sglang/blob/100f5b8bc976773b595923665715eb13d3bfcab6/python/sglang/srt/managers/scheduler.py#L358-e NCCL_SOCKET_IFNAME=eth
-e CUDA_DEVICE_ORDER=PCI_BUS_ID
--attention-backend triton
--sampling-backend pytorch
<-- same as setting --disable-flashinfer-sampling
--disable-radix-cache
--disable-regex-jump-forward
--disable-cuda-graph
--disable-cuda-graph-padding
--disable-disk-cache
--disable-custom-all-reduce
--disable-mla
I know that many of these flags disabling optimizations won't be necessary in the end. I was looking for a deterministic setting before reducing changed settings again.
Server side:
python3 -m sglang.launch_server --model-path mistralai/Mixtral-8x22B-Instruct-v0.1 --attention-backend triton --sampling-backend pytorch --disable-radix-cache --disable-regex-jump-forward --disable-cuda-graph --disable-cuda-graph-padding --disable-disk-cache --disable-custom-all-reduce --disable-mla --random-seed 42 --tp-size 8 --dp-size 1 --host 0.0.0.0 --port 30001
Any idea what could still be missing @merrymercy ? If we figure it out, I can work on a PR
Can you try --max-running-request 1
?
Thanks, adding --max-running-request 1
made it reproducible.
Running consecutive runs of run_batch
on hundreds of sample datapoints returns the same logits per sample both for multiple run_batch
commands and of course over server restarts. Great!
One thing I've tried is readding the radix cache, however when removing the --disable-radix-cache
flag, the logits are no longer reproduable. I haven't looked at reducing the other flags yet.
The current setup comes with quite the speed decrease of course, but for "offline" use cases such as synthetic data creation, LLM as a judge (in various structured generation / CoT setups) having a slower generation/evaluation would be acceptable and total reproducibility highly valued.
Quick Observation: Going for the setup described as above, the output is currently seed invariant.
I.e. there is no sampling involved anymore: independent of random_seed value, the output is the same
Checklist
Motivation
I believe there is an option for fixing the random seed for the backend, but I think there isn't a feature for per-request random seeds.
Related resources
No response