naver / splade

SPLADE: sparse neural search (SIGIR21, SIGIR22)
Other
710 stars 79 forks source link

Multilingual version of SPLADE #28

Closed nickchomey closed 1 year ago

nickchomey commented 1 year ago

I'm very impressed by SPLADE, particularly the newest efficient versions. However, it is only trained on English texts.

There's an mMARCO dataset that has 14 languages, which is already in use by SBERT and other projects. Importantly, there's a doc2query mt5 model that uses this dataset. It seems to me that anyone using non-english (or multiple) languages would have no choice but to use this. A SPLADE version would be fantastic, especially if compared to the mT5 version of doc2query on BEIR zero-shot data!

Even better would be if you could somehow use the FLORES-200 dataset, which is used by the cutting edge NLLB-200 translation model!

Would you consider implementing a multilingual version in a future iteration of SPLADE? I think this would provide immense value to the global community!

Also, its not clear to me that the SPLADE++ methods were used as part of your efficient version. So, it would be great if you could use and compare it with the other methods.

cadurosar commented 1 year ago

I'm very impressed by SPLADE, particularly the newest efficient versions. However, it is only trained on English texts.

Hi @nickchomey thanks for your words and your interest in SPLADE! Actually, we have some initial results that should hit arxiv this week, as we used a multilingual version of SPLADE in the WSDM CUP 23 - MIRACL competition. But is still early work and it should take us a while to have a model that we can share on huggingface.

There's an mMARCO dataset that has 14 languages, which is already in use by SBERT and other projects. Importantly, there's a doc2query mt5 model that uses this dataset. It seems to me that anyone using non-english (or multiple) languages would have no choice but to use this.

So, this is something that we have been looking into for a while, and as I spoiled have done a little bit for a competition, but not on our languages of mMARCO, but around the same amount (15 of MIRACL). There are two main problems:

1) getting a good model initialized in a multilingual fashion that works with SPLADE. Some people have tried to use out-of-the-shelf pretrained multilingual models with mixed results (for example https://desires.dei.unipd.it/2022/papers/paper-06.pdf most notably look into Figure 2). As we have seen in the efficiency study for SPLADE, a good initialization is paramount, but pretraining multi-lingual models is quite costly.

2) SPLADE training and inference cost increases a lot with vocabulary size. For a single language, a tokenizer of around 30k might suffice (which is the size we use for English), but going into multilingual we need to increase vocabulary size. Going for 120k vocab size (mbert) makes it a lot more costly, reducing the max length we can go and/or requiring larger gpus. If we try 240k vocab (infoxlm-roberta) than it is even harder

A SPLADE version would be fantastic, especially if compared to the mT5 version of doc2query on BEIR zero-shot data!

I'm sorry but I'm not sure to understand here. I can't see how to compare multilingual models on BEIR, as that would only judge the models capabilities in English which could skew the results.

Even better would be if you could somehow use the FLORES-200 dataset, which is used by the cutting edge NLLB-200 translation model!

On the subject of FLORES and NLLB, while we think this is still far in the future, using NLLB to translate from any language to English and then using the SPLADE++ models is a proxy that we know to work well. For example in TREC-neuCLIR this is what worked the best out of the different approaches we tested.

Would you consider implementing a multilingual version in a future iteration of SPLADE? I think this would provide immense value to the global community!

Also, its not clear to me that the SPLADE++ methods were used as part of your efficient version. So, it would be great if you could use and compare it with the other methods.

Yes, it is kinda unclear because we submitted both at the same time, so we had to write them as if the other one did not exist. I will take this chance to write an explanation for this, and I can maybe point people here when I see this question. What we found is that there is a trade-off between them. SPLADE++ was the best that we could (effectiveness-wise) do at the moment with out-of-the-shelf LMs at the moment we submitted, while the efficient trades some of that effectiveness (mostly out-of-domain effectiveness) for efficiency. We still haven't found a way to have the best of both worlds.

The efficiency paper kinda starts from SPLADE++. Actually, the baseline we use there ("improvements i) and ii)") are basically SPLADE++ ensemble distil, but taking the models with fewer flops from Figure 1 in https://arxiv.org/pdf/2205.04733.pdf.

The efficiency paper than introduces 4 actual novel improvements to SPLADE. Out of those 4, 3 actually makes it harder for working on out-of-domain data, especially BEIR. Each of the improvements has a different reason for worsening the BEIR score:

1) Improvement 3) suggests separating the document and query encoder. However, some of the datasets on BEIR, such as QUORA, have no separation between what is a document and what is a query. Thus, separating the concepts and using the "wrong" concept encoder lessens the effectiveness on those datasets

2) Improvement 5) suggests doing MLM+FLOPS middle-training on MSMARCO. While the MLM+FLOPS training helps a lot in preparing the model for SPLADE, it further specializes in the model on MSMARCO, which makes it harder for the model to generalize to BEIR.

3) Improvement 6) suggest changing the query encoder to a more efficient one. This not only increases the problem we had seen on Improvement 3), but it also creates a difference in concepts between the document and query encoder. While this does not affect much on MSMARCO, it makes it so that less used words/unseen concepts point to different parts of the latent space

Note that we would have loved to have that on the papers, but the main problems were 1) they were submitted at the same time and thus could not cite each other; 2) At SIGIR22 full paper submission time, it was not clear how to make them a single study (and they were not ready by then, the extra time of the short paper really helped); 3) as we were constrained by the 4 pages of the short paper it was hard to find the place to add a detailed explanation such as this one

nickchomey commented 1 year ago

WOW, what a great response! Thanks so much for not just having such patience with the outdated and sometimes erroneous (e.g. BEIR for multilingual) thoughts of a relatively-novice "practitioner", but for all the explanations and insights!

I'm thrilled to find out about MIRACL and the WSDM Cup 2023 - evidently considerably bigger brains and wallets than mine are keenly working on this very problem! As you explained in wonderful detail re: SPLADE++ vs efficient SPLADE, it is quite difficult to produce monolingual models that perform well on accuracy, latency and cost (FLOPS and index size), and its surely all that much more difficult with multilingual. I look forward to seeing what comes out!

Given that I've got maybe 12 months until I plan to go to production with my multilingual non-profit educational platform, it seems like my best course of action is to wait for results, papers, scripts and models from the Cup to be released, and to see what comes out. I really hope that there will be a focus on implementations that sacrifice some out-of-domain accuracy in favour of low-cost (fast latency on single CPU with small index size). Not just for the sake of my project and budgetary constraints, but the bulk of the world (especially the non-English areas) don't really have access to/budgets for GPUs, vector databases etc...

In the meantime, given that I do plan to use NLLB for providing autotranslation between languages (e.g. Facebook-like click to translate this post/page/document), perhaps I'll experiment with translating all documents to the common English upon ingestion, and then use something like efficient SPLADE++ to enrich the doc. At search-time, I can autotranslate all search queries to English to identify the doc and then return it in either the original or translated the user's language. When new tools/models/approaches from WSDM Cup 2023 become available (probably in the coming few months?), I can swap them in for efficient SPLADE.

Would you be able to point me towards any literature on that NLLB + SPLADE++ approach? And perhaps even comparing to other such multilingual approaches (e.g. the doc2query mT5 approach that I suggested, which could then perhaps be further refined at index-time with the mMARCO cross-encoder as suggested by the approach in this paper)?

Thanks again!

cadurosar commented 1 year ago

Sorry, did not had the time to look into this today, but the MIRACL paper is now online here. It should have been up on arxiv, but having some trouble with it (mostly my fault though)...

nickchomey commented 1 year ago

Thanks so much! It looks like my ideas/plan was already pretty similar to what you ended up doing (though I wasn't going to train any new models).

I'll read it more carefully later, but I notice that it is very... preliminary... I actually like how it is not in standard research paper format and style, but there's various typos and the style and grammar could use lots of improvements.

I'm a native English speaker, so if you like I would be happy to annotate it with corrections, comments and questions and share a link? Or if you have a word document that you can share, that would be even easier.

And congrats on 1st place! I clearly found the right team and model to pay attention to ;) But it will be interesting to see all the other papers when they're available, and I'm sure that some real progress will be made when everyone has had time to review and synthesize it all.

cadurosar commented 1 year ago

Oh, it would be of great help if you can send me an annotated version with corrections/comments/questions. Writing the results from MIRACL came at the same time as the SIGIR deadline and a long trip to Singapore, so in the end, the version of the paper I have now is not that good as my writing skills (which are not great) were already depleted.

Also, I completely agree with you on the last paragraph. I still have a lot of questions about how we got 1st place, which is actually the more interesting thing than getting first place. There is certainly some advantage linked to the huge brute forcing we did with the rerankers and by not excluding the less-performing first-stage retrievers. I hope that further analysis of the final result leads the community to better insights into how to deal with this kind of problem. Honestly, I feel like our work is just a hard baseline, but is not that interesting in the sense of treating this as a multilingual problem.

If you are interested in the other team's submissions, which I think are much better written and more to the point, I've seen them on arxiv here and here.

nickchomey commented 1 year ago

No need to excuse yourself! You guys did fantastic work!

I'm particularly pleased with how genuinely curious you seem to be - youve been quite open about the fact that it's entirely possible that you overfitted the models or brute forced the results in some way, or made other errors. I suspect that some of the other teams were not as honest with their papers/approaches.

In the end, again, I'm just thrilled that there's clearly many very bright people with a decent amount of resources (I just have a laptop right now...) who are focused on this topic. I'm sure that we'll see some wonderful advances in the coming year as you all get time to review and synthesize the various approaches, and maybe even combine computing resources to train some robust SOTA models.

But, again, I really hope that you and others will keep a focus on eventually making things as low-compute as possible such that people anywhere will be able to do all of this on single (or few) CPU threads. Perhaps future efforts could produce FULL and LITE versions that require GPU vs CPU.

Anyway, I'll be happy to contribute some editing and feedback to the endeavour (my technical skills are surely not of much help here). I'll send a link for an annotated copy of your paper sometime today.

nickchomey commented 1 year ago

Here's a version that I converted to a word doc and made edits and added some comments. It's not perfect, but should be a meaningful improvement. I hope this helps!

NaverMIRACL.docx

cadurosar commented 1 year ago

Hey, @nickchomey thanks a lot for taking the time to help with this. WSDM just finished and I'm going back home, so I will try to put a corrected version up sometime next week. I would like to add you to the acknowledgments, can I add you just as Nick Chomey?

nickchomey commented 1 year ago

Don't worry about it - I don't need any acknowledgement. I hope the conference went well!

thibault-formal commented 1 year ago

hi @nickchomey FYI, we plan to write a long paper version summarizing SPLADE++ and EfficientSPLADE -- hopefully out in a few months. We will let you know!

nickchomey commented 1 year ago

I look forward to it! One question - will it be a more formal summary of what you produced for the competition, or will it build upon what you've already done and incorporate things you've learned from the competition, make it more efficient, etc...?

thibault-formal commented 1 year ago

It will not be related to the competition, but more a formal and extended paper merging our works on SPLADE++ and EfficientSPLADE (so no multilingual)

nickchomey commented 1 year ago

Ah yes of course - I mistook splade++ for the multilingual models. I look forward to seeing the new paper!

Will there be a more formal multilingual splade paper and model coming out as well, as was mentioned in the initial paper? And will you be looking to apply your merged splade++ and efficientsplade findings with multilingual at some point?

thibault-formal commented 1 year ago

hi @nickchomey , at the moment, it's not planned to apply efficient extensions to multilingual. I close the issue, feel free to re-open it!

nickchomey commented 1 year ago

That's unfortunate. I hope you'll change your plans at some point!