Open ChristofferHolmesland opened 4 years ago
for the idf, it should use the count of whole documents in the index which is 54709 and not document count based upon search hits because it only counts document with field body.
i didn't zoom in on the search result based analyze_query so cannot answer on the 2nd one. anyone ?
for the idf, it should use the count of whole documents in the index which is 54709 @FebriantiW
That fixed my issue, thank you! I assumed it was correct to use the document count from the search because we are calculating the IDF based on the body field.
If anyone else is wondering about the analyze_query
issue, you can ignore the terms and still pass every test in the notebook.
for the idf, it should use the count of whole documents in the index which is 54709 and not document count based upon search hits because it only counts document with field body.
i didn't zoom in on the search result based analyze_query so cannot answer on the 2nd one. anyone ? I might be wrong but it said will calculate IDF based on the 'body' field only
and it does not hold for 'trec9_index' to pass the test
I got the same issue/numbers as @ChristofferHolmesland then later realised that in the idf formula that it considers entire document collection as N. It should pass the test even if you get the term doc count from body fields only.
Has anyone been able to pass the
X_train[0]
test?I'm unable to get the correct IDF values (the tests on the toy index are ok). The first query in
TRAIN_QUERY_IDS
isDeath, sudden
, giving me IDF values[3.2432702900360164, 5.0103371466737]
.Another issue I found is that the
analyze_query
function doesn't always find the correct document id. For example the termhallucinations
is supposedly in document87125370
. When I look at the term vectors the closest term ishallucin
. What is the proper way of handling this? Ignore the terms? Assume document frequency of 1? Other terms with the same issue: ray, densitometry, osteophytosis, elastomers, silicone, ...