terrier-org / pyterrier

A Python framework for performing information retrieval experiments, building on http://terrier.org/
https://pyterrier.readthedocs.io/
Mozilla Public License 2.0
415 stars 65 forks source link

How to get document frequency #496

Closed eyasu11321238a closed 7 hours ago

eyasu11321238a commented 8 hours ago

Hello,

I’m currently using PyTerrier and have implemented two functions to retrieve term-related statistics from an index. Below are the functions I’m working with:

def get_term_collection_freq(term, index):
    lexicon = index.getLexicon()
    lexicon_entry = lexicon.getLexiconEntry(term)
    return lexicon_entry.getFrequency() if lexicon_entry else 0

def get_document_frequency(term, index):
    lexicon = index.getLexicon()
    lexicon_entry = lexicon.getLexiconEntry(term)
    return lexicon_entry.getDocumentFrequency() if lexicon_entry else 0

The first function get_term_collection_freq() works as expected, providing the collection frequency of a term. However, the second function get_document_frequency() does not seem to work because the lexicon_entry object does not have a method getDocumentFrequency().

Could you please confirm if there is an equivalent method for getting the document frequency of a term from the lexicon or suggest an alternative way to retrieve it?

Thank you for your help!

seanmacavaney commented 7 hours ago

Hi @eyasu11321238a

I tried the code above and it seems to work fine for me on a basic example (no error with getDocumentFrequency):

!pip install -U python-terrier

import pyterrier as pt

index_ref = pt.terrier.IterDictIndexer('./test.terrier').index([
    {'docno': '1', 'text': 'hello world hello world'},
    {'docno': '2', 'text': 'hello'},
])
index = pt.IndexFactory.of(index_ref)

def get_term_collection_freq(term, index):
    lexicon = index.getLexicon()
    lexicon_entry = lexicon.getLexiconEntry(term)
    return lexicon_entry.getFrequency() if lexicon_entry else 0

def get_document_frequency(term, index):
    lexicon = index.getLexicon()
    lexicon_entry = lexicon.getLexiconEntry(term)
    return lexicon_entry.getDocumentFrequency() if lexicon_entry else 0

get_term_collection_freq("hello", index)
3
get_term_collection_freq("world", index)
2
get_term_collection_freq("oov", index)
0

get_document_frequency("hello", index)
2
get_document_frequency("world", index)
1
get_document_frequency("oov", index)
0

Can you try this example see if it works on your machine?

eyasu11321238a commented 7 hours ago

@seanmacavaney, Thank you for your response. Both functions work perfectly now after upgrading with !pip install -U python-terrier. I appreciate your help!

cmacdonald commented 7 hours ago

I'm not sure what the problem would have been in an older version of PyTerrier, as this functionality is quite old. In future, if you can post the error message, it helps us understand the problem :-)

seanmacavaney commented 7 hours ago

@eyasu11321238a Can you share the version that you had installed previously? It might help us track the issue down.

eyasu11321238a commented 7 hours ago

@seanmacavaney The version was 0.10.0 and now I upgraded it to 0.11.0

seanmacavaney commented 6 hours ago

Thanks!