terrierteam / pyterrier_doc2query

35 stars 9 forks source link

Some feedback and suggestions about your paper #7

Closed nickchomey closed 1 year ago

nickchomey commented 1 year ago

I just came across your fantastic paper and have some feedback and suggestions for future work.

First some nitpicking. You miscalculated the reduction in query execution time and index size:

Suggestions for future work:

Further cleaning the data

More comprehensive benchmarking

I'd suggest using the BEIR benchmark, particularly for out-of-sample/domain datasets. It seems to be the most comprehensive way to evaluate all of this and is what is used by the SPLADE team to show how their method is superior to docT5query. SPLADE recently got a lot of attention when Pinecone recently published an article about using it. Relevant papers about SPLADE. https://arxiv.org/pdf/2107.05720.pdf https://arxiv.org/pdf/2109.10086.pdf https://arxiv.org/pdf/2110.11540.pdf https://arxiv.org/pdf/2205.04733.pdf https://arxiv.org/pdf/2207.03834v1.pdf

Multilingual

And, more generally, I think that there would be a lot of value in exploring the tenets of a Data Centric Approach to all of this, which advocates for the sort of data cleaning that you're doing rather than minor improvements from ever more complex models.

I hope this helps! I really think this approach has enormous potential for providing great IR results at low cost. I'd be happy to chat further about any of it!

seanmacavaney commented 1 year ago

Thanks, we'll look into it!

We perform stemming (PISA's implementation of Porter2). We found that keeping stopwords improves performance for these models, and was necessary to replicate the results from the doct5query paper.

I'm familiar with SPLADE -- it builds on the EPIC architecture.

Ultimately, the experiments were carefully designed to fit the limits of the venue's page count.

By the way, since you're interested in multilingual IR, I recommend checking out the TREC NeuCLIR track!

seanmacavaney commented 1 year ago

I corrected the two faulty %reductions (good catch!) and submitted the corrected version to arxiv. I checked the other %s in the paper, just to be sure (i.e., those in Table 2).

I also checked and we do already contrast the work with MLM-style expansion. In fact, in the conclusions, we already mention that MLM expansion could be explored in future work.

nickchomey commented 1 year ago

Thanks for the reply and tip about TREC NeuCLIR! As it turns out, the WSDM 23 conference starts today and on Friday they'll be discussing the MIRACL competition that focuses on multilingual information retrieval.

p.s. it seems that the original author/creator of doc2query, Jimmy Lin, is one of the primary people in this space and an organizer of that competition, and he's come out with all sorts of models and approaches in the past 4 years that improve upon docT5query.

So, it looks like we'll have a lot of wonderful innovation in the coming months/year!