naver / splade

SPLADE: sparse neural search (SIGIR21, SIGIR22)
Other
710 stars 79 forks source link

Inquiry about Configuration Details for "ecir23-scratch-tydi-japanese-splade" Model #55

Closed kuro96al closed 2 months ago

kuro96al commented 5 months ago

Hello, I am currently developing a Japanese model and have been referencing the "ecir23-scratch-tydi-japanese-splade" model on Hugging Face for guidance. I would greatly appreciate it if you could share the specific settings, including the models and datasets used, to create this model. This information will be incredibly helpful for my project. Thank you in advance for your assistance.

url:https://huggingface.co/naver/ecir23-scratch-tydi-japanese-splade

carlos-lassance commented 5 months ago

Hi @kuro96al,

we pretrained the model from scratch using the japanese Mr.TyDi corpus (https://github.com/castorini/mr.tydi), we then trained with a contrastive loss using japanese MMARCO (https://github.com/unicamp-dl/mMARCO) and finally finetuned with the japanese Mr.TyDi train query set.

The model is based on a distilbert (6L, 768Hidden dims), but as said previously, the model is initialized randomly and then trained as described in the previous paragraph.

For more information here's a paper talking about the strategies we used to develop that model and what we were looking for: https://arxiv.org/pdf/2301.10444.pdf

kuro96al commented 4 months ago

Thank you for your response. Is the pre-trained model uploaded on platforms like Hugging Face?

kuro96al commented 4 months ago

We attempted to train SPLADE based on the model found at https://huggingface.co/line-corporation/line-distilbert-base-japanese/tree/main, but it seems that there were issues with the vocabulary that prevented successful training.

carlos-lassance commented 4 months ago

Thank you for your response. Is the pre-trained model uploaded on platforms like Hugging Face?

Unfortunately it is not, not sure if we still have it...

We attempted to train SPLADE based on the model found at https://huggingface.co/line-corporation/line-distilbert-base-japanese/tree/main, but it seems that there were issues with the vocabulary that prevented successful training.

Yeah, we found similar problems with a ton of models, that's one of the reasons we went with training a model from scratch.