The difference between multilingual-e5-base and e5-base

microsoft / unilm

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities

https://aka.ms/GeneralAI

MIT License

19.56k stars 2.49k forks source link

The difference between multilingual-e5-base and e5-base #1113

Open HAOChuzhan opened 1 year ago

HAOChuzhan commented 1 year ago

I am using the model of multilingual-e5-base, which has a good effect on the Chinese datasets. Thank you very much for your approach!

Therefore, I'd like to ask you some questions.

Do both models have the same two-stage training?
What are the specific differences between the training data of the two stages for both models?
If I want to fine-tune on a larger model (chinese-roberta-large), how can I achieve the effect of your multilingual-e5-base model here?

I would be very grateful if author could answer my doubts! 😊

intfloat commented 1 year ago

For your questions:

Do both models have the same two-stage training? Yes, the techniques are the same, but the data is different. The first stage is contrastive pre-training, and the second stage is supervised fine-tuning.
What are the specific differences between the training data of the two stages for both models? Multilingual-e5 models use multilingual data for both stages, while e5-base only uses English data.
If I want to fine-tune on a larger model (chinese-roberta-large), how can I achieve the effect of your multilingual-e5-base model here? You need to collect many Chinese text pairs, and then follow our paper to do two-stage training. This is generally a time-consuming process.

HAOChuzhan commented 1 year ago

Thanks for your reply! I found that only Multilingual-E5-base model is provided on HuggingFace. Whether the Multilingual-E5-large version has been open? If so, could you please provide me with the checkpoints of Multilingual-E5-large?

intfloat commented 1 year ago

We'll release multilingual-e5-large checkpoint, but it will take some time, perhaps weeks.

wilfoderek commented 1 year ago

I am ansious to test the new release.