Open biniyoni opened 1 week ago
hi. I dont think there is a getTerms() in direct index classes.
There's a good example here: https://pyterrier.readthedocs.io/en/latest/terrier-index-api.html#what-terms-occur-in-the-11th-document
What might be even easier for your use case is index.get_corpus_iter()
, as documented at https://pyterrier.readthedocs.io/en/latest/terrier-index-api.html#can-i-get-the-index-as-a-corpus-iter
Craig
Thank you so much Craig
Are you looking for the raw text though, rather than just the indexed terms and frequencies?
For reranking, I need the document text along with the corresponding document ID. I want to extract the documents with their doc ID from the indexed file without needing to download the raw documents. If there's a way to do this, I would greatly appreciate it if you could let me know!
Yup! If your index includes the text metadata, you can build a text lookup transformer using:
index = pt.IndexRef.of('./my_index.terrier')
text_loader = pt.text.get_text(index)
Note that you'll need to index with the meta
fields that you want to store. There's an example here.
The text_loader
transformer adds the text columns for the metadata based on the docno. So an input DataFrame of:
docno | score |
---|---|
A | 0.250 |
B | 0.80 |
C | 0.420 |
Will give an output DataFrame of:
docno | score | text |
---|---|---|
A | 0.250 | text from index of document A |
B | 0.80 | text from index of document B |
C | 0.420 | text from index of document C |
(get_text
also works with datasets loaded from ir_datasets, but that doesn't look like it's the case here.)
There's more information on this page: https://pyterrier.readthedocs.io/en/latest/text.html
Thank you Sean!!. I greatly appreciate your reply.
Hellow Everyone,
I am currently working with the PyTerrier framework and have encountered an issue while trying to access document content their corresponding document IDs after indexing a dataset.
For example, I have successfully indexed the Vaswani dataset, but I am facing challenges when attempting to retrieve the documents along with their corresponding docno (document ID). Below is the code I have used: