kbalog / uis-dat640-fall2020

Information Retrieval and Text Mining course at the University of Stavanger (DAT640), 2020 fall
8 stars 9 forks source link

A5 prepare_ltr_training_data #10

Open ChristofferHolmesland opened 4 years ago

ChristofferHolmesland commented 4 years ago

Has anyone been able to pass the X_train[0] test?

I'm unable to get the correct IDF values (the tests on the toy index are ok). The first query in TRAIN_QUERY_IDS is Death, sudden, giving me IDF values [3.2432702900360164, 5.0103371466737].

Another issue I found is that the analyze_query function doesn't always find the correct document id. For example the term hallucinations is supposedly in document 87125370. When I look at the term vectors the closest term is hallucin. What is the proper way of handling this? Ignore the terms? Assume document frequency of 1? Other terms with the same issue: ray, densitometry, osteophytosis, elastomers, silicone, ...

FebriantiW commented 4 years ago

for the idf, it should use the count of whole documents in the index which is 54709 and not document count based upon search hits because it only counts document with field body.

i didn't zoom in on the search result based analyze_query so cannot answer on the 2nd one. anyone ?

ChristofferHolmesland commented 4 years ago

for the idf, it should use the count of whole documents in the index which is 54709 @FebriantiW

That fixed my issue, thank you! I assumed it was correct to use the document count from the search because we are calculating the IDF based on the body field.

ChristofferHolmesland commented 4 years ago

If anyone else is wondering about the analyze_query issue, you can ignore the terms and still pass every test in the notebook.

thek123 commented 4 years ago

for the idf, it should use the count of whole documents in the index which is 54709 and not document count based upon search hits because it only counts document with field body.

i didn't zoom in on the search result based analyze_query so cannot answer on the 2nd one. anyone ? I might be wrong but it said will calculate IDF based on the 'body' field only

and it does not hold for 'trec9_index' to pass the test

FebriantiW commented 4 years ago

I got the same issue/numbers as @ChristofferHolmesland then later realised that in the idf formula that it considers entire document collection as N. It should pass the test even if you get the term doc count from body fields only.