Closed koheiw closed 3 years ago
Merging #15 (66a5409) into master (61adc39) will increase coverage by
0.30%
. The diff coverage is100.00%
.
@@ Coverage Diff @@
## master #15 +/- ##
==========================================
+ Coverage 98.79% 99.10% +0.30%
==========================================
Files 4 4
Lines 332 334 +2
==========================================
+ Hits 328 331 +3
+ Misses 4 3 -1
Impacted Files | Coverage Δ | |
---|---|---|
R/proxy.R | 99.00% <100.00%> (+0.02%) |
:arrow_up: |
src/linear.cpp | 98.68% <100.00%> (ø) |
|
src/pair.cpp | 99.20% <100.00%> (+0.79%) |
:arrow_up: |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update a91014f...66a5409. Read the comment docs.
@kasperwelbers, do you have comments on smoothing?
@koheiw, no comments really, I think this approach works well. I'm not an expert on smoothing, but every now and then I read up on smoothing in the hopes of finding something better than laplace / add-delta smoothing, and I never found strong evidence that a more sophisticated approach works better (for unigram smoothing).
@kasperwelbers thanks. Do you want to add smoothing to other measures than Chi-squared and Kullback?
@koheiw I don't think I would suggest that. I suppose the argument would be that it makes sense to the user if smoothing was an option for every measure, but it would also make things confusing, as you'd have to explain why they can but probably shouldn't smooth for things like cosine similarity.
I don't think there is a strong benefit of using smoothing for dot product based measures. But I do see a strong downside. I haven't though this through carefully, but I suppose it would mean that the resulting adjacency matrix would always be completely dense (If no term value is zero, every document pair has non-zero similarity).
It is easy to make smoothing possible in all the pariwise methods, but it does not make sense for set-theory based or spatial similarity/distance. Chis-squared and Kullback are difference compares distributions as you know. Canberra seems to be a spatial measure.
Results would be dense but it is not a problem becasue users can till sparsify them usingmin_simil
or rank
.
Right, I think that distinction makes sense. You just need smoothing if the vectors represent probability distributions.
The dense results indeed wouldn't be a problem with some pruning, but users might not want to use a threshold if they don't need to. So then smoothing would just create an extra reason to use a threshold.
Seems that we are good already if my classification is correct.
Hey all! I'll review these changes on Wednesday, would that be ok?
@rcannood it would be great if you could review this PR and merge before I update the CRAN version. I addressed major bugs recently.
For the issue #8, I added the smooth argument. I also corrected how Chi2 and Kullback divergence scores are computed. They match the entropy package's output now. Do you think I have to add smooth to other measures like Canberra?