Code for E - Githubissues

microsoft / unilm

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities

https://aka.ms/GeneralAI

MIT License

19.16k stars 2.44k forks source link

Code for E #1107

Closed yclzju closed 1 year ago

yclzju commented 1 year ago

Hi, would you release code for pretrain and finetune for E5.

intfloat commented 1 year ago

Thanks for your interest.

The fine-tuning code is based on https://github.com/microsoft/unilm/tree/master/simlm , with minor differences in input format:

E5 does not use token type id
E5 adds "query: " and "passage: " prefix to the inputs.

For the pre-training part, we currently have no plan to release the collected data, but implementation of the contrastive loss is fairly straightforward.

Liang

yclzju commented 1 year ago

Hi, thanks for reply, if I have some dataset such as Chinese dataset, I can use [intfloat/multilingual-e5-base] as initial checkpoint, then finetune based on simlm?

intfloat commented 1 year ago

Sure, you can certainly do that.