lm-sys / FastChat

An open platform for training, serving, and evaluating large language models. Release repo for Vicuna and Chatbot Arena.
Apache License 2.0
36.52k stars 4.5k forks source link

Is there a way to combine data parallel and model parallel? #1524

Open sunyuhan19981208 opened 1 year ago

sunyuhan19981208 commented 1 year ago

I would like to inquire about the possibility of combining data parallelism and model parallelism in the context of training llm. I found that the model parallel only support 1 batch while data parallel can not distribute one model to many cards. If I have 1000 1080ti cards and I want train a 65B model in a big batch size, what should I do?

vinvcn commented 1 year ago

Same question. On a A100 80G machine, I setup the vicuna13b model running as the openai_api_server. I then wrote a script to send 10 requests in batch to API server. However, it seems the API server processes the requests one by one. I also submit curl request from another terminal, it's blocking until all previous requests were processed.

zenetio commented 1 year ago

You can use Fully Sharded Data Parallel (FSDP for short)

surak commented 11 months ago

@zenetio but that's not Model Parallel. Do you have any hints on that one?