microsoft / unilm

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
https://aka.ms/GeneralAI
MIT License
19.11k stars 2.44k forks source link

LLMA code of https://arxiv.org/pdf/2304.04487.pdf ? #1065

Closed trianxy closed 1 year ago

trianxy commented 1 year ago

I was reading this arxiv article which points to the current repo (https://github.com/microsoft/unilm).

The article describes a method to speed-up inference for large language models by ~2x. This is extremely very exciting!

Are you planning on open-sourcing the respective source code inside this repo? Or can you give any other details which can help us in implementing/reproducing the results?

nyanyanya commented 1 year ago

@trianxy Thanks for the interest in our work. We plan to release the code in another repo: https://github.com/microsoft/LMOps soon. As for now, can you specify what details are you missing for reproducing the result? Maybe I can help to clarify.

trianxy commented 1 year ago

Great @nyanyanya - I will be looking for updates in the above repo.

I was looking for hints how I can adjust model code, loaded by Huggingface's transformers library (e.g. the models Bloom/GPT-neo/Pythia), to implement the inference speed-up mentioned in the paper.

I am seeing details in sections 2+3 of the paper, but wondered if there was more than that.

nyanyanya commented 1 year ago

@trianxy I don't think there are more tricks beyond what's in the paper. One possible issue might be that the efficiency of our method depends on the overlaps of the references and the outputs. If the decoding outputs of your model on your task don't have much overlaps, you might not get the same speed-up.

I uploaded 10 examples of inputs and outputs pairs used in our retrieval-augmented generation here: https://github.com/nyanyanya/LMOps/blob/main/llma/example.jsonl.

You can check if your model can generate similar outputs with the examples. If the your outputs have significant less overlaps with the documents, you probably won't get the same speed-up using our method.