Some feedback and suggestions about your paper

nickchomey commented 1 year ago

I just came across your fantastic paper and have some feedback and suggestions for future work.

First some nitpicking. You miscalculated the reduction in query execution time and index size:

0.95/1.41-1 = -32% (vs the quoted 1.41/0.95-1 = 48%)
23/30-1 = -23% (vs 30/23-1 = 30%). Still very impressive, but smaller. The accuracy was properly calculated though (.323/.279-1 = 16%)

Suggestions for future work:

Further cleaning the data

What would be the effect of lemmatizing? I think you could first pass the full original document into doc2query so it has maximum context. But once you filter and concatenate the generated queries, you could then lemmatize the entire thing to standardize all the terms on their root words. You would then just lemmatize search query terms (very rapid) prior to running the BM25 query
You could probably even remove stop words either prior to or after lemmatizing. They shouldn't affect the score/relevance much, but would further reduce the index size

More comprehensive benchmarking

I'd suggest using the BEIR benchmark, particularly for out-of-sample/domain datasets. It seems to be the most comprehensive way to evaluate all of this and is what is used by the SPLADE team to show how their method is superior to docT5query. SPLADE recently got a lot of attention when Pinecone recently published an article about using it. Relevant papers about SPLADE. https://arxiv.org/pdf/2107.05720.pdf https://arxiv.org/pdf/2109.10086.pdf https://arxiv.org/pdf/2110.11540.pdf https://arxiv.org/pdf/2205.04733.pdf https://arxiv.org/pdf/2207.03834v1.pdf

Multilingual

There's an excessive focus on English for all of this sort of stuff. So, I'd love to see all of this tested using the doc2query mT5 model based on the 14 language mMARCO dataset.
You could also use this multilingual SBERT cross-encoder that was trained on the same dataset for the filtering.

And, more generally, I think that there would be a lot of value in exploring the tenets of a Data Centric Approach to all of this, which advocates for the sort of data cleaning that you're doing rather than minor improvements from ever more complex models.

I found a lot of useful info about all of this when I started looking into the Argilla platform https://www.argilla.io/
This video was fantastic as well.
And this related competition https://https-deeplearning-ai.github.io/data-centric-comp/

I hope this helps! I really think this approach has enormous potential for providing great IR results at low cost. I'd be happy to chat further about any of it!

seanmacavaney commented 1 year ago

Thanks, we'll look into it!

We perform stemming (PISA's implementation of Porter2). We found that keeping stopwords improves performance for these models, and was necessary to replicate the results from the doct5query paper.

I'm familiar with SPLADE -- it builds on the EPIC architecture.

Ultimately, the experiments were carefully designed to fit the limits of the venue's page count.

By the way, since you're interested in multilingual IR, I recommend checking out the TREC NeuCLIR track!

seanmacavaney commented 1 year ago

I corrected the two faulty %reductions (good catch!) and submitted the corrected version to arxiv. I checked the other %s in the paper, just to be sure (i.e., those in Table 2).

I also checked and we do already contrast the work with MLM-style expansion. In fact, in the conclusions, we already mention that MLM expansion could be explored in future work.

nickchomey commented 1 year ago

Thanks for the reply and tip about TREC NeuCLIR! As it turns out, the WSDM 23 conference starts today and on Friday they'll be discussing the MIRACL competition that focuses on multilingual information retrieval.

p.s. it seems that the original author/creator of doc2query, Jimmy Lin, is one of the primary people in this space and an organizer of that competition, and he's come out with all sorts of models and approaches in the past 4 years that improve upon docT5query.

So, it looks like we'll have a lot of wonderful innovation in the coming months/year!

terrierteam / pyterrier_doc2query