vLLM has been added as the default implementation due to its much higher inference speeds than the default Hugging Face implementation (30x faster). Thanks to @ KennethEnevoldsen for the suggestion! The models are still downloaded from HF although quantized models are not recommended with vLLM (not really important considering the speed).
We're still investigating whether there are any substantial differences to the inference output between the two implementation types, but currently no systematic differences have been identified between HF and vLLM.
vLLM has been run on one A100 and two A100s respectively. A noticeable speedup is gained from adding an extra GPU, but not really needed (runs currently 5000 stories with one A100 in around 19 minutes as opposed to 5-10 hours with HF implementation).
Using Hugging Face?
The project will at this point continue to support both types of implementation (HF and vLLM). Ideally, we would want to let the user decide and if they have more compute to burn, they may do so as they wish. Instructions on how to use the generation pipeline (with or without HF) are still to be found in src/generate.
Moving forward
Will soon begin generating data with set sampling parameters. Still need to play a little with the various sampling parameters. Ideally also add sampling parameters to the dataframes as a column or metadata, so that we keep track for later analysis.
vLLM
vLLM has been added as the default implementation due to its much higher inference speeds than the default Hugging Face implementation (30x faster). Thanks to @ KennethEnevoldsen for the suggestion! The models are still downloaded from HF although quantized models are not recommended with vLLM (not really important considering the speed).
We're still investigating whether there are any substantial differences to the inference output between the two implementation types, but currently no systematic differences have been identified between HF and vLLM.
vLLM has been run on one A100 and two A100s respectively. A noticeable speedup is gained from adding an extra GPU, but not really needed (runs currently 5000 stories with one A100 in around 19 minutes as opposed to 5-10 hours with HF implementation).
Using Hugging Face?
The project will at this point continue to support both types of implementation (HF and vLLM). Ideally, we would want to let the user decide and if they have more compute to burn, they may do so as they wish. Instructions on how to use the generation pipeline (with or without HF) are still to be found in
src/generate
.Moving forward
Will soon begin generating data with set sampling parameters. Still need to play a little with the various sampling parameters. Ideally also add sampling parameters to the dataframes as a column or metadata, so that we keep track for later analysis.
Still need to fix dailydialog also :)