mhutter / string-similarity

Calculate String Similarities
MIT License
88 stars 8 forks source link

N-gram cosine similarity #8

Closed imustafin closed 4 years ago

imustafin commented 4 years ago

This PR adds support for computing the cosine similarity using n-grams of the strings. The old behavior is equivalent to using 1-grams. Example from the provided specs:

it 'returns correct 2-gram similarity' do
  # here _ is substitution for the pad symbol
  # abc has bigrams: _a, ab, bc, c_
  # abcacbc has bigrams: _a, ab, bc, ca, ac, cb, bc, c_

  expect(klass.cosine('abc', 'abcacbc',
                      ngram: 2)).to be_within(0.001).of(0.79)
end

Maybe it can be documented more.

Anyway, would be happy to receive any feedback :)

imustafin commented 4 years ago

Just noticed that the correct exception wasn't raised.

Yes, I tried to add more specs to show better how it works. Should be correct :)