How to convert NLLB models to ONNX format and quantify them in INT4

LiPengtao0504 commented 2 months ago

How can I convert NLLB models to ONNX format and quantify them in INT4? Are there any recommended methods

niedev commented 2 months ago

Hi, the easiest way is to use optimum for the conversion to ONNX, and for 4bit quantization use the default OnnxRuntime method they recently added.

If you also want to know how I made certain optimizations to reduce RAM consumption and use kv-cache let me know, I can explain it to you in broad terms (a complete tutorial with details would take too long, maybe one day I'll write an article about it).

LiPengtao0504 commented 2 months ago

Yes, I am very interested in knowing how to perform certain optimizations to reduce RAM consumption and use kv cache, as well as how to speed up inference. Also, what are the excellent translation models you know, and is there a translation performance ranking? Thank you very much, God bless you

niedev commented 1 month ago

Sorry if I'm only replying now (I forgot 😅). I'll start by saying that to fully understand this explanation you should know how transformer models work in some detail.

As for the optimizations I made to reduce RAM consumption for NLLB, after exporting the model with optimum (with the option to use the kv cache) I extracted:

the decoder and encoder part that is used to perform the embed (for NLLB they are the same),
the decoder lm-head.

And I put them in a separate model (NLLB_embed_and_lm_head.onnx) (since these components use the same matrix, and if it is not separated, when we load the encoder and decoder into memory we load it twice, increasing RAM consumption).

Also, normally, to use the kv-cache with a model exported with optimum, you use (and load into RAM) two slightly different copies of the decoder (one called "with past" and the other without), to avoid loading two copies of the decoder into RAM I separated the component of the decoder without past that generate the kv-cache of the encoder part (NLLB_cache_initializer.onnx) and used only the decoder with past (NLLB_decoder.onnx) for the inference.

Inference, without going into detail, works like this (some things are not precise, this explanation alone gives the principles of how inference works):

Given an input sentence to translate, first it is converted to input_ids with the tokenizer (integrated in the app code, I use sentencepiece), then these input ids are embedded using NLLB_embed_and_lm_head.onnx, the result is used as input to the encoder (NLLB_encoder.onnx), and the encoder output is used as input to NLLB_cache_initializer.onnx.

At this point, to the decoder we pass, as kv-cache:

The output matrices of NLLB_cache_initializer.onnx, which are the kv-cache of the encoder.
Some matrices of length 0 that we generate in the app code, which are decoder kv-cache (now of length 0 because it is the first iteration).

And, as other input, the embed of the tokenization of a special character (now I don't remember which one) (using NLLB_embed_and_lm_head.onnx).

After that it will generate as output a series of matrices that are the new decoder kv-cache, and a matrix that we ignore for now.

Now we re-run the decoder but with the new decoder kv-cache (we always use the same encoder kv-cache) and with a different special character embed (which this time indicates the input language, so it depends on the selected input language).

We re-run the decoder with the new decoder kv-cache and with the embed of a special character that indicates the output language.

At this point we have, as said before, some output matrices that are the new kv-cache of the decoder and a matrix that this time we do not ignore, but we pass from NLLB_embed_and_lm_head.onnx to obtain (using the lm-head) a series of values (logits) from which we will obtain the first output word generated by our model.

Now we repeatedly execute the decoder, passing it each time its new kv-cache (and the same kv-cache of the encoder) and the embed of the word obtained from the previous iteration. And we continue like this until the word generated is a special word that indicate the end of the generation (called eos).

To optimize the RAM consumption and speed of Whisper I used the same principles.

Also, what are the excellent translation models you know, and is there a translation performance ranking?

I know another model that has a very good quality, even higher than NLLB, that is Madlad, I have not integrated it into RTranslator yet because it is quite bigger than NLLB (3.3B parameters vs 600M), but probably in the future I will add it as an option.

As far as I know, there is no simple ranking of the quality of translation models (or performance), the only way to know is to read the papers of each translation model. However, from my research, the best for now are NLLB and Madlad.

I hope I have clarified your doubts.

LiPengtao0504 commented 1 month ago

I exported three models of the NLLB model using onnx, which are decoderer_with_cst_madel_quantified. onnx decoder_model_quantized.onnx encoder_model_quantized.onnx， I'm not sure how to use onnx runtime for inference, and which variables make up the inputs of the encoder and decoder, respectively

niedev commented 3 weeks ago

As I said in issue #84 a full tutorial will be too long (I don't have much free time), but I will convert this issue to a discussion, so more people will see it and maybe someone can help you.

niedev / RTranslator

How to convert NLLB models to ONNX format and quantify them in INT4 #76