Quickfix: repeated vllm model cleanup when data_parallel_size>1

huggingface / lighteval

Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends

MIT License

831 stars 99 forks source link

Quickfix: repeated vllm model cleanup when data_parallel_size>1 #399

Closed anton-l closed 6 days ago

anton-l commented 6 days ago

The cleanup seems to be called from multiple processes in data parallel mode, so this just ensures there's no error due to the already deleted model object.

NathanHB commented 6 days ago

Thanks for the PR ! I'm pretty sure that in data // mode, each process has it's own model so it should not be an issue. Did you run into an issue were the model was already deleted ? Also, I think the cleanup is only ran on the first process

anton-l commented 6 days ago

@NathanHB yes, I catch an error there due to the model already being None. It only happens with vllm,data_parallel_size=2 and up, no issues if I disable data parallelism.

NathanHB commented 6 days ago

Ohh I did not take vllm // into account, it indeed works deifferently, good catch !