terrier-org / pyterrier

A Python framework for performing information retrieval experiments, building on http://terrier.org/
https://pyterrier.readthedocs.io/
Mozilla Public License 2.0
412 stars 65 forks source link

get_corpus_iter() from a Terrier index #425

Closed cmacdonald closed 7 months ago

cmacdonald commented 7 months ago

pyterrier_pisa has a get_corpus_iter(), courtesy of @seanmacavaney

It would be useful to have such a function for a Terrier index.

A few options that I can foresee:

  1. Terrier's Index class can expose a single get_corpus_iter() method that has pretokenized option. If pretokenized is true a version of the direct index. If pretokenized is false, its a version of the meta index.

  2. Expose these in the relevant classes: a. direct index exposes a get_corpus_iter() which is pretokenized b. meta index exposes a get_corpus_iter() which is metadata

  3. Both!

seanmacavaney commented 7 months ago

Sounds great! It's elsewhere, too, e.g., FlexIndex.get_corpus_iter() in pyterrier_dr.

cmacdonald commented 7 months ago

So option 3?

seanmacavaney commented 7 months ago

I'd like the default to be to return everything that's available. If there's a case where that isn't needed, we can add options later.

cmacdonald commented 7 months ago

430 merged