Open marco-c opened 1 year ago
I would say it can speed up decoding since the library is optimized for inference. The GPU benchmarks for float16 are interesting. It's 2x+ faster than Marian with only a slight decrease in BLEU. We do use half-precision decoding, but on 4-8 GPUs per machine.
How long are we currently spending on translations in the pipeline?
Translation is split across several tasks. For Hungarian, the total time spent on it was around 443 hours. If we can speed it up by 2x, it'll be a nice cost saving (and it'll also mean the pipeline will go faster, even though we could achieve the same by splitting across even more tasks).
The command to convert a Marian model to CTranslate2 is ct2-marian-converter (ct2-opus-mt-converter is also interesting): https://opennmt.net/CTranslate2/guides/marian.html. As input, we need the model.npz file (we can use a teacher one from a train-teacher or finetune-teacher task, or a student one from a train-student or finetune-student task, or a quantized one from a quantize task), and vocabulary files (we have spm vocab, but not the format CTranslate2 is expecting).
To generate the vocab, we can use the marian-vocab command. Perhaps we can add marian-vocab execution to the train-vocab task, so we have the necessary vocab files readily available.
For now, we can only convert the teacher model. The student model is using a RNN-based decoder, which isn't supported by CTranslate2 yet.
I managed to convert our teacher model (the Hungarian one from our latest en->hu run) by changing the load_vocab
function in https://github.com/OpenNMT/CTranslate2/blob/c6f7f3bcc61964ca787cadf796e237fa0025f483/python/ctranslate2/converters/marian.py#L118 to:
def load_vocab(path):
import sentencepiece as spm
sp = spm.SentencePieceProcessor(path)
return [sp.id_to_piece(i) for i in range(sp.vocab_size())]
(NOTE: presumably we can convert the vocab.spm to a yaml file and load it without having to patch CTranslate2)
Then running: ct2-marian-converter --model_path model.npz.best-chrf.npz --vocab_paths vocab.spm --output_dir ct2_teacher_model
.
Then, to run the model:
import ctranslate2
import sentencepiece as spm
translator = ctranslate2.Translator("ct2_teacher_model", device="cpu")
sp = spm.SentencePieceProcessor("vocab.spm")
input_text = "Hello, world!"
input_tokens = sp.encode(input_text, out_type=str)
results = translator.translate_batch([input_tokens])
output_tokens = results[0].hypotheses[0]
print(sp.decode(output_tokens))
CTranslate2 does not support ensemble translating out of the box. I'm investigating further.
So it looks like Marian is doing something that CTranslate2 doesn't support. I believe we'd have to fork and add support ourselves to CTranslate2 to use ensembles.
Boiled down in Marian it does:
// The log probability scores for each word in the vocab for the next token prediction.
Expr stepScores;
// Here scores_ is our vector of teacher models.
for(size_t i = 0; i < scorers_.size(); ++i) {
// This is the log probability of the logits.
Expr logProbs;
// Compute log probabilities using the current scorer
logProbs = states[i]->getLogProbs().getLogits();
// Combine the scores from all scorers
if(i == 0)
// The first model sets the stepScores, and applies its weight. In our case the
// weight is 1.0.
stepScores = scorers_[i]->getWeight() * logProbs;
else
// Each successive model adds its log probabilities to the step to create a
// combined prediction. CTranslate2 can't do this unless the source code is modified.
stepScores = stepScores + scorers_[i]->getWeight() * logProbs;
}
Source: https://github.com/marian-nmt/marian-dev/blob/master/src/translator/beam_search.cpp#L456
We could also investigate removing teacher ensemble: #778
That would unblock us here.
If we test a single model for CTranslate2, we should also test the quality hit on using a quantized model for additional performance gains.
This should speed up model training.