koheiw / proxyC

R package for large-scale similarity/distance computation
GNU General Public License v3.0
29 stars 6 forks source link

Add smooth2 #15

Closed koheiw closed 3 years ago

koheiw commented 3 years ago

For the issue #8, I added the smooth argument. I also corrected how Chi2 and Kullback divergence scores are computed. They match the entropy package's output now. Do you think I have to add smooth to other measures like Canberra?

codecov[bot] commented 3 years ago

Codecov Report

Merging #15 (66a5409) into master (61adc39) will increase coverage by 0.30%. The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #15      +/-   ##
==========================================
+ Coverage   98.79%   99.10%   +0.30%     
==========================================
  Files           4        4              
  Lines         332      334       +2     
==========================================
+ Hits          328      331       +3     
+ Misses          4        3       -1     
Impacted Files Coverage Δ
R/proxy.R 99.00% <100.00%> (+0.02%) :arrow_up:
src/linear.cpp 98.68% <100.00%> (ø)
src/pair.cpp 99.20% <100.00%> (+0.79%) :arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update a91014f...66a5409. Read the comment docs.

koheiw commented 3 years ago

@kasperwelbers, do you have comments on smoothing?

kasperwelbers commented 3 years ago

@koheiw, no comments really, I think this approach works well. I'm not an expert on smoothing, but every now and then I read up on smoothing in the hopes of finding something better than laplace / add-delta smoothing, and I never found strong evidence that a more sophisticated approach works better (for unigram smoothing).

koheiw commented 3 years ago

@kasperwelbers thanks. Do you want to add smoothing to other measures than Chi-squared and Kullback?

kasperwelbers commented 3 years ago

@koheiw I don't think I would suggest that. I suppose the argument would be that it makes sense to the user if smoothing was an option for every measure, but it would also make things confusing, as you'd have to explain why they can but probably shouldn't smooth for things like cosine similarity.

I don't think there is a strong benefit of using smoothing for dot product based measures. But I do see a strong downside. I haven't though this through carefully, but I suppose it would mean that the resulting adjacency matrix would always be completely dense (If no term value is zero, every document pair has non-zero similarity).

koheiw commented 3 years ago

It is easy to make smoothing possible in all the pariwise methods, but it does not make sense for set-theory based or spatial similarity/distance. Chis-squared and Kullback are difference compares distributions as you know. Canberra seems to be a spatial measure.

Results would be dense but it is not a problem becasue users can till sparsify them usingmin_simil or rank.

kasperwelbers commented 3 years ago

Right, I think that distinction makes sense. You just need smoothing if the vectors represent probability distributions.

The dense results indeed wouldn't be a problem with some pruning, but users might not want to use a threshold if they don't need to. So then smoothing would just create an extra reason to use a threshold.

koheiw commented 3 years ago

Seems that we are good already if my classification is correct.

rcannood commented 3 years ago

Hey all! I'll review these changes on Wednesday, would that be ok?

koheiw commented 3 years ago

@rcannood it would be great if you could review this PR and merge before I update the CRAN version. I addressed major bugs recently.

koheiw commented 3 years ago

Thank you for taking your time @rcannood. It would be fantastic to add hamming! The algorithm looks super simple.

Let me merge this PR.