[Feature] Per-request random seed

laoconeth commented 2 months ago

Checklist

[X] 1. If the issue you raised is not a feature but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
[X] 2. Please use English, otherwise it will be closed.

Motivation

I believe there is an option for fixing the random seed for the backend, but I think there isn't a feature for per-request random seeds.

Related resources

No response

merrymercy commented 2 months ago

contributions are welcome, the related code is https://github.com/sgl-project/sglang/blob/dff2860a690757966e408b598a8f0b47a29a4713/python/sglang/srt/layers/sampler.py#L83-L85

aflah02 commented 2 months ago

Hi @merrymercy Just spotted this issue when I was looking for another thing. Is there a way to set a universal seed to make things reproducible? Same requests getting same output? I am not concerned with a per-request seed just something you set once. I couldn't find it

merrymercy commented 1 month ago

@aflah02 This is the global random seed. https://github.com/sgl-project/sglang/blob/39bb49d156f2319d2aec67c458c2db980bb0f4c3/python/sglang/srt/server_args.py#L362

However, depending on your workloads (e.g., batch size), determinism is still very hard to achieve

FredericOdermatt commented 1 month ago

I run inference with run_batch for a SglSelect choice on different datasets and have done the following steps in order to make the generation deterministic, but have not been successful yet:

set --random-seed
added another call to set_random_seed(self.random_seed) at the beginning of https://github.com/sgl-project/sglang/blob/100f5b8bc976773b595923665715eb13d3bfcab6/python/sglang/srt/managers/scheduler.py#L358
docker run additional env arguments:
- -e NCCL_SOCKET_IFNAME=eth
- -e CUDA_DEVICE_ORDER=PCI_BUS_ID
--attention-backend triton
--sampling-backend pytorch <-- same as setting --disable-flashinfer-sampling
--disable-radix-cache
--disable-regex-jump-forward
--disable-cuda-graph
--disable-cuda-graph-padding
--disable-disk-cache
--disable-custom-all-reduce
--disable-mla

I know that many of these flags disabling optimizations won't be necessary in the end. I was looking for a deterministic setting before reducing changed settings again.

Server side:

python3 -m sglang.launch_server --model-path mistralai/Mixtral-8x22B-Instruct-v0.1 --attention-backend triton --sampling-backend pytorch --disable-radix-cache --disable-regex-jump-forward --disable-cuda-graph --disable-cuda-graph-padding --disable-disk-cache --disable-custom-all-reduce --disable-mla --random-seed 42 --tp-size 8 --dp-size 1 --host 0.0.0.0 --port 30001

Any idea what could still be missing @merrymercy ? If we figure it out, I can work on a PR

merrymercy commented 1 month ago

Can you try --max-running-request 1?

FredericOdermatt commented 1 month ago

Thanks, adding --max-running-request 1 made it reproducible.

Running consecutive runs of run_batch on hundreds of sample datapoints returns the same logits per sample both for multiple run_batch commands and of course over server restarts. Great!

One thing I've tried is readding the radix cache, however when removing the --disable-radix-cache flag, the logits are no longer reproduable. I haven't looked at reducing the other flags yet.

The current setup comes with quite the speed decrease of course, but for "offline" use cases such as synthetic data creation, LLM as a judge (in various structured generation / CoT setups) having a slower generation/evaluation would be acceptable and total reproducibility highly valued.

FredericOdermatt commented 1 month ago

Quick Observation: Going for the setup described as above, the output is currently seed invariant.

I.e. there is no sampling involved anymore: independent of random_seed value, the output is the same

sgl-project / sglang