terrier-org / pyterrier

A Python framework for performing information retrieval experiments, building on http://terrier.org/
https://pyterrier.readthedocs.io/
Mozilla Public License 2.0
415 stars 65 forks source link

How to access the Indexed Trec file documents with their corresponding doc_id? #492

Open biniyoni opened 1 week ago

biniyoni commented 1 week ago

Hellow Everyone,

I am currently working with the PyTerrier framework and have encountered an issue while trying to access document content their corresponding document IDs after indexing a dataset.

For example, I have successfully indexed the Vaswani dataset, but I am facing challenges when attempting to retrieve the documents along with their corresponding docno (document ID). Below is the code I have used:


# Initialize PyTerrier
if not pt.started():
    pt.init()

# Get the dataset
dataset = pt.get_dataset("vaswani")

# Set index path (absolute) and ensure the directory exists
index_path = r"c:\var\index_vaswani"
os.makedirs(index_path, exist_ok=True)

# Create the indexer with term positions (blocks=True) and specify encoding
indexer = pt.TRECCollectionIndexer(index_path, blocks=True, properties={"trec.encoding": "UTF-8"})

# Index the dataset
indexref = indexer.index(dataset.get_corpus())

# Get the index reference and print statistics
index = pt.IndexFactory.of(indexref)
print(index.getCollectionStatistics().toString())

#---------------------------------------------------------------------------------------------------------------
# Accessing the documents
# Get the MetaIndex for accessing docno and DirectIndex for document content
meta = index.getMetaIndex()
direct = index.getDirectIndex()

# Get the number of documents in the index
num_docs = index.getCollectionStatistics().getNumberOfDocuments()

# Iterate over all documents in the index by their internal docid
for docid in range(num_docs):
    # Access the docno (document ID) from the MetaIndex
    docno = meta.getItem("docno", docid)

    # Access the document content from the DirectIndex (returns an iterator over terms)
    terms = direct.getTerms(docid)

    # Convert the terms iterator to a list of terms
    text = ' '.join(str(term) for term in terms)

    # Print the document ID and a snippet of the document content
    print(f"Document ID: {docno}")
    print(f"Document Text: {text[:100]}...")  # Limiting output to 100 characters for readability
#-------------------------------------------------------------------------------------------------------------------
Here is the error I am encountering: 

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[69], [line 24](vscode-notebook-cell:?execution_count=69&line=24)
     [21](vscode-notebook-cell:?execution_count=69&line=21) docno = meta.getItem("docno", docid)
     [23](vscode-notebook-cell:?execution_count=69&line=23) # Access the document content from the DirectIndex (returns an iterator over terms)
---> [24](vscode-notebook-cell:?execution_count=69&line=24) terms = direct.getTerms(docid)
     [26](vscode-notebook-cell:?execution_count=69&line=26) # Convert the terms iterator to a list of terms
     [27](vscode-notebook-cell:?execution_count=69&line=27) text = ' '.join(str(term) for term in terms)

AttributeError: 'org.terrier.structures.PostingIndex' object has no attribute 'getTerms'
#-----------------------------------------------------------------------------------------------------------------------------------

So is there any way to access the indexed documents with their corresponding doc_id?

I would greatly appreciate your guidance. Thank you for your assistance!
cmacdonald commented 1 week ago

hi. I dont think there is a getTerms() in direct index classes.

There's a good example here: https://pyterrier.readthedocs.io/en/latest/terrier-index-api.html#what-terms-occur-in-the-11th-document

What might be even easier for your use case is index.get_corpus_iter(), as documented at https://pyterrier.readthedocs.io/en/latest/terrier-index-api.html#can-i-get-the-index-as-a-corpus-iter

Craig

biniyoni commented 1 week ago

Thank you so much Craig

cmacdonald commented 1 week ago

Are you looking for the raw text though, rather than just the indexed terms and frequencies?

biniyoni commented 1 week ago

For reranking, I need the document text along with the corresponding document ID. I want to extract the documents with their doc ID from the indexed file without needing to download the raw documents. If there's a way to do this, I would greatly appreciate it if you could let me know!

seanmacavaney commented 1 week ago

Yup! If your index includes the text metadata, you can build a text lookup transformer using:

index = pt.IndexRef.of('./my_index.terrier')
text_loader = pt.text.get_text(index)

Note that you'll need to index with the meta fields that you want to store. There's an example here.

The text_loader transformer adds the text columns for the metadata based on the docno. So an input DataFrame of:

docno score
A 0.250
B 0.80
C 0.420

Will give an output DataFrame of:

docno score text
A 0.250 text from index of document A
B 0.80 text from index of document B
C 0.420 text from index of document C

(get_text also works with datasets loaded from ir_datasets, but that doesn't look like it's the case here.)

There's more information on this page: https://pyterrier.readthedocs.io/en/latest/text.html

biniyoni commented 1 week ago

Thank you Sean!!. I greatly appreciate your reply.