tonytonyjan / jaro_winkler

Ruby & C implementation of Jaro-Winkler distance algorithm which supports UTF-8 string.
MIT License
192 stars 29 forks source link

Switch UTF-8 conversion to go to UTF-16 instead of unsigned long long #8

Closed tepperly closed 6 years ago

tepperly commented 9 years ago

I tried converting UTF-8 into uint16_t instead of unsigned long long. The wikipedia documentation on UTF-8 says that this should be valid. On my machine this makes the comparison faster.

Rehearsal ----------------------------------------------------
jaro_winkler       0.390000   0.000000   0.390000 (  0.381969)
fuzzystringmatch   0.510000   0.000000   0.510000 (  0.506941)
hotwater           0.580000   0.010000   0.590000 (  0.586986)
amatch             1.180000   0.000000   1.180000 (  1.177425)
------------------------------------------- total: 2.670000sec

                       user     system      total        real
jaro_winkler       0.380000   0.000000   0.380000 (  0.381681)
fuzzystringmatch   0.620000   0.000000   0.620000 (  0.625234)
hotwater           0.570000   0.000000   0.570000 (  0.569079)
amatch             1.030000   0.000000   1.030000 (  1.037745)

The benchmarks prior to the change are

Rehearsal ----------------------------------------------------
jaro_winkler       0.460000   0.000000   0.460000 (  0.460248)
fuzzystringmatch   0.500000   0.010000   0.510000 (  0.508658)
hotwater           0.580000   0.000000   0.580000 (  0.584665)
amatch             1.040000   0.000000   1.040000 (  1.055028)
------------------------------------------- total: 2.590000sec

                       user     system      total        real
jaro_winkler       0.470000   0.000000   0.470000 (  0.473629)
fuzzystringmatch   0.480000   0.000000   0.480000 (  0.488322)
hotwater           0.570000   0.000000   0.570000 (  0.573510)
amatch             1.150000   0.000000   1.150000 (  1.175987)

I expected it to be faster because it reduces the memory bandwidth requirements.

tonytonyjan commented 6 years ago

Sorry for the delayed response, and thanks for your pull request. Unfortunately, I will close this since codepoints conversion is no longer hardcoded, and uses MRI API instead. It also fixed #7.

Thank you for making jaro_winkler better 😃