telekom / mltb2

Machine Learning Toolbox 2
https://telekom.github.io/mltb2/
MIT License
10 stars 4 forks source link

Add translation tool #7

Open PhilipMay opened 1 year ago

PhilipMay commented 1 year ago

like this

en2de = torch.hub.load('pytorch/fairseq', 'transformer.wmt19.en-de',
                       checkpoint_file='model1.pt:model2.pt:model3.pt:model4.pt',
                       tokenizer='moses', bpe='fastbpe')
_ = en2de.eval()  # disable dropout
_ = en2de.cuda()  # use GPU
PhilipMay commented 1 year ago

see code examples of Philip in GerAlpacaDataCleaned

PhilipMay commented 1 year ago
en2de = torch.hub.load('pytorch/fairseq', 'transformer.wmt19.en-de',
                       checkpoint_file='model1.pt:model2.pt:model3.pt:model4.pt',
                       tokenizer='moses', bpe='fastbpe')

_ = en2de.cuda()

en_de_texts = []
chunks = list(more_itertools.chunked(df["en"].tolist(), 10))
for chunk in tqdm(chunks):
    en_de_texts.extend(en2de.translate(chunk))
PhilipMay commented 1 year ago

Also could add facebook/nllb-200-distilled-600M

PhilipMay commented 1 year ago

fairseq model dependencies are:

hydra-core
omegaconf
bitarray
sacrebleu
sacremoses
Cython
fastBPE

Problem with fastBPE: https://github.com/glample/fastBPE/issues/27#issuecomment-531544543