Closed nickchomey closed 1 year ago
Thanks, we'll look into it!
We perform stemming (PISA's implementation of Porter2). We found that keeping stopwords improves performance for these models, and was necessary to replicate the results from the doct5query paper.
I'm familiar with SPLADE -- it builds on the EPIC architecture.
Ultimately, the experiments were carefully designed to fit the limits of the venue's page count.
By the way, since you're interested in multilingual IR, I recommend checking out the TREC NeuCLIR track!
I corrected the two faulty %reductions (good catch!) and submitted the corrected version to arxiv. I checked the other %s in the paper, just to be sure (i.e., those in Table 2).
I also checked and we do already contrast the work with MLM-style expansion. In fact, in the conclusions, we already mention that MLM expansion could be explored in future work.
Thanks for the reply and tip about TREC NeuCLIR! As it turns out, the WSDM 23 conference starts today and on Friday they'll be discussing the MIRACL competition that focuses on multilingual information retrieval.
p.s. it seems that the original author/creator of doc2query, Jimmy Lin, is one of the primary people in this space and an organizer of that competition, and he's come out with all sorts of models and approaches in the past 4 years that improve upon docT5query.
So, it looks like we'll have a lot of wonderful innovation in the coming months/year!
I just came across your fantastic paper and have some feedback and suggestions for future work.
First some nitpicking. You miscalculated the reduction in query execution time and index size:
Suggestions for future work:
Further cleaning the data
More comprehensive benchmarking
I'd suggest using the BEIR benchmark, particularly for out-of-sample/domain datasets. It seems to be the most comprehensive way to evaluate all of this and is what is used by the SPLADE team to show how their method is superior to docT5query. SPLADE recently got a lot of attention when Pinecone recently published an article about using it. Relevant papers about SPLADE. https://arxiv.org/pdf/2107.05720.pdf https://arxiv.org/pdf/2109.10086.pdf https://arxiv.org/pdf/2110.11540.pdf https://arxiv.org/pdf/2205.04733.pdf https://arxiv.org/pdf/2207.03834v1.pdf
Multilingual
And, more generally, I think that there would be a lot of value in exploring the tenets of a Data Centric Approach to all of this, which advocates for the sort of data cleaning that you're doing rather than minor improvements from ever more complex models.
I hope this helps! I really think this approach has enormous potential for providing great IR results at low cost. I'd be happy to chat further about any of it!