Closed trianxy closed 1 year ago
@trianxy Thanks for the interest in our work. We plan to release the code in another repo: https://github.com/microsoft/LMOps soon. As for now, can you specify what details are you missing for reproducing the result? Maybe I can help to clarify.
Great @nyanyanya - I will be looking for updates in the above repo.
I was looking for hints how I can adjust model code, loaded by Huggingface's transformers library (e.g. the models Bloom/GPT-neo/Pythia), to implement the inference speed-up mentioned in the paper.
I am seeing details in sections 2+3 of the paper, but wondered if there was more than that.
@trianxy I don't think there are more tricks beyond what's in the paper. One possible issue might be that the efficiency of our method depends on the overlaps of the references and the outputs. If the decoding outputs of your model on your task don't have much overlaps, you might not get the same speed-up.
I uploaded 10 examples of inputs and outputs pairs used in our retrieval-augmented generation here: https://github.com/nyanyanya/LMOps/blob/main/llma/example.jsonl.
You can check if your model can generate similar outputs with the examples. If the your outputs have significant less overlaps with the documents, you probably won't get the same speed-up using our method.
I was reading this arxiv article which points to the current repo (https://github.com/microsoft/unilm).
The article describes a method to speed-up inference for large language models by ~2x. This is extremely very exciting!
Are you planning on open-sourcing the respective source code inside this repo? Or can you give any other details which can help us in implementing/reproducing the results?