Closed moritzschaefer closed 8 years ago
I seriously don't know what is happening here. I believe the problem is with the length (math.sqrt(doc_length[key])*math.sqrt(query_length[key])) i divided so every score has unit length. Or maybe we can use a max_all_scores. I should meditate on this.
Is it important to normalize? Are you normalizing the right way? ;)
Brian A. Ignacio notifications@github.com schrieb am Di., 10. Mai 2016, 20:32:
I seriously don't know what is happening here. I believe the problem is with the length (math.sqrt(doc_length[key])*math.sqrt(query_length[key])) i divided so every score has unit length. Or maybe we can use a max_all_scores. I should meditate on this.
— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub https://github.com/moritzschaefer/the-search-engine/issues/45#issuecomment-218131656
Is very important to normalize. I believed i was doing it the right way, but now, not sure :(
This is the cosine formula. If you see both AND and OR the use a doc_length and query_length for normalization. Also compare with the formula.
EDIT: Try using only doc_length and see what happens. I just check some code online and they don't normalize query weight. Let me know what happens.
and_scored_docs[key] = and_scored_docs[key] / (math.sqrt(doc_length[key])
or_scored_docs[key] = or_scored_docs[key] / (math.sqrt(doc_length[key])
check test/test_ranker.py
I asked this before once I think: Why are all values soo close to 1? I think the behaviour is reflected in our tests, so it can be fixed with tests:
Please experiment on the tests to get values that are different from ~1.
The values are close to 1 because test case is almost the same as the dictionary file, so values are very high. Example: a query is "query to evaluate" and the dict is "query to evaluate". The difference is that not all docs have all terms. I will make some more tests on this to further analyze the behavior.
Thank you!
So you think there is no relation between the 1s we get on the server and the close to 1s in the tests?
Brian A. Ignacio notifications@github.com schrieb am Do., 12. Mai 2016, 00:19:
The values are close to 1 because test case is almost the same as the dictionary file, so values are very high. Example: a query is "query to evaluate" and the dict is "query to evaluate". The difference is that not all docs have all terms. I will make some more tests on this to further analyze the behavior.
— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub https://github.com/moritzschaefer/the-search-engine/issues/45#issuecomment-218492333
People do this in soooooo many ways online. Every blog or post approach this very different. The cosine was very high because put 3 docs in test and high number of total docs so a match. Also the tf-idf value made in the index doesn't work for our purposes. A couple of ways is a way to fix this behaviour but you are not going to like it.
I believe the way idf was calculated is not very good. For testing purposes, if the number of docs is very close to term_doc_list_length then results are not accurate. I'm testing the following.
1) Change tf-idf in indexer.base.py from (1+log tf)_log(N/df) to use (tf_term / totalwords)(1+log(N/df)) and also in query calculation. Tested and calculated by hand. The PAIN is would require to rebuild the the index to change the tf-idf. total_words refers to the number of words in a document and tf_term is the number of times a term appears in a doc. N is the number of docs and df is how many docs have this term. Reference Review commit https://github.com/moritzschaefer/the-search-engine/commit/57dd7717e58a09ad2e2105664c34c449b59d0967
2) We could try an implementation of Stanford NLP and from what i got by changing here and there is that although in theory we don't need to change the index, we would need a way to get the total_words[doc_id] of a doc and divide the score by this
Currently working on this.
UPDATE This 2nd method looks promising. Although i though no change was going to be made. I prove myself wrong. BUT in here we save computation power by avoiding computation of query weight. The modification would be change tf-idf in indexer.base.py from (1+log tf)log(N/df) to use (1+log tf)(1+log(N/df)) (Just add a one to tf and idf). Although this should work without rebuilding index since tf-idf are very high (this didn't work for testing since idf = log(N/df) was 0 ). I like the resulting values and makes sense from theory, but im not 100% sure of this. Review this commit https://github.com/moritzschaefer/the-search-engine/commit/582e8a2a5eabe3dd31c0c00829441873b5e48f0b
Rebuilding the index is okay. If you say this is the proper way...
What about the slides? I thought it's pretty accurate how the prof suggested it...
Brian A. Ignacio notifications@github.com schrieb am Do., 12. Mai 2016, 07:43:
People do this in soooooo many ways online. Every blog or post approach this very different. The cosine was very high because put 3 docs in test and high number of total docs so a match. Also the tf-idf value made in the index doesn't work for our purposes. A couple of ways is a way to fix this behaviour but you are not going to like it.
I believe the way idf was calculated is not very good. For testing purposes, if the number of docs is very close to term_doc_list_length then results are not accurate. I'm testing the following.
1) Change tf-idf in indexer.base.py from (1+log tf)_(N/df) to use (tf_term / totalwords)(1+log(N/df)) and also in query calculation. Tested and calculated by hand. The PAIN is would require to rebuild the the index to change the tf-idf.* total_words* refers to the number of words in a document and _tfterm is the number of times a term appears in a doc. N is the number of docs and df is how many docs have this term. Reference https://janav.wordpress.com/2013/10/27/tf-idf-and-cosine-similarity/ Review commit 57dd771 https://github.com/moritzschaefer/the-search-engine/commit/57dd7717e58a09ad2e2105664c34c449b59d0967
2) We could try an implementation of Stanford NLP http://nlp.stanford.edu/IR-book/html/htmledition/efficient-scoring-and-ranking-1.html and i believe no modification needs to be changed in index. Currently working on this.
— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub https://github.com/moritzschaefer/the-search-engine/issues/45#issuecomment-218612016
The slides are based in the book of Information Retrieval used by Stanford. The 2nd method is not in the slides but it is supposed to work faster. The slides and most of the material i found online are not very clear with what they use for tf-idf, because there are many ways for query-doc cosine similarity. qqq.ddd is notation and a common one is lnc.ltc Reference
Please review the following: This 2nd method looks promising. Although i though no change was going to be made. I prove myself wrong. BUT in here we save computation power by avoiding computation of query weight. The modification would be change tf-idf in indexer.base.py from (1+log tf)(log(N/df)) to use (1+log tf)(1+log(N/df)) (Just add a one to tf and idf). Although this should work without rebuilding index since tf-idf are very high (this didn't work for testing since idf = log(N/df) was 0 ). I like the resulting values and makes sense from theory, but im not 100% sure of this. Review this commit https://github.com/moritzschaefer/the-search-engine/commit/582e8a2a5eabe3dd31c0c00829441873b5e48f0b
@moritzschaefer Please check https://github.com/moritzschaefer/the-search-engine/commit/4c7e64c4f09efdb9f5200dbb525e67e4a5d0378a
Can we close this then?
Yap do it
Brian A. Ignacio notifications@github.com schrieb am Di., 24. Mai 2016, 19:09:
Can we close this then?
— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/moritzschaefer/the-search-engine/issues/45#issuecomment-221338296
This code makes ALL and_scored_docs[key] to 1. before this, they have a large variety of different values. After that all become 0. This means, for each key, the following is true: and_scored_docs[key] == (math.sqrt(doc_length[key])*math.sqrt(query_length[key]))
Could you please double check your formula? I really don't get how this can happen..