terrier-org / pyterrier

A Python framework for performing information retrieval experiments, building on http://terrier.org/
https://pyterrier.readthedocs.io/
Mozilla Public License 2.0
397 stars 63 forks source link

How to Retrieve Document Text Using Its ID from an Index? #444

Open eyasu11321238a opened 2 months ago

eyasu11321238a commented 2 months ago

Hey guys. I encountered an issue while attempting to retrieve text documents from my indexed TREC file to compute the log-likelihood probability. Specifically, when I run the get_document_text function, I receive the following error:

AttributeError: 'org.terrier.querying.IndexRef' object has no attribute 'getDocuments'

Here are the functions I'm using:

Function to fetch the document text using the document ID from the index

def get_document_text(doc_id, index): metaindex = index.getMetaIndex() doc_id_int = metaindex.getDocumentEntry("docno", doc_id) document_text = metaindex.getItem("text", doc_id_int) return document_text

Define NTLM scorer

def ntlm_scorer(row): query_terms = row['query'].split() doc_id = row['docno'] doc_text = get_document_text(doc_id, index) # Ensure doc_id is an integer score = compute_log_likelihood_score(query_terms, doc_text, word_embeddings) return score

Initial retrieval using DirichletLM

Dirichlet = pt.BatchRetrieve(index_path, wmodel="DirichletLM", controls={'dirichletlm.mu': 1500}, verbose=True)

Chaining NTLM scoring

pipeline = Dirichlet >> pt.apply.doc_score(ntlm_scorer, verbose=True)

Request: Could you please provide suggestions on how to resolve this error? Any guidance on what might be causing this and how to properly fetch document text from the index would be greatly appreciated.

Thank you!

cmacdonald commented 2 months ago

can you paste the full stack trace. Your error is mystifying as you have getDocuments in your error but not in your code. Are you sure the code is in sync with what you are executing?

eyasu11321238a commented 2 months ago

Hey Craig, Thanks for your reply, I have checked the getDocument function. I was using getDocumentEntry instead of getDocument. It works now. That was the problem.