php / php-src

The PHP Interpreter
https://www.php.net
Other
37.97k stars 7.73k forks source link

levenshtein bug #13823

Open MyNameIsRoman opened 5 months ago

MyNameIsRoman commented 5 months ago

Description

The following code:

<?php
echo levenshtein('aa', 'ab', 1, PHP_INT_MAX, 1);

Resulted in this output:

-9223372036854775807

But I expected this output instead:

2

Also levenshtein(str_repeat('a', 100000), str_repeat('b', 1000000), 1, PHP_INT_MAX - 500000, 1)

returning unexpected -9223372036854625811

PHP Version

any

Operating System

any 64-bit

nielsdos commented 5 months ago

This could be solved by changing the computation of optimal cost to use deltas. I'd have to check what the performance impact of that is, and if there is a better way though. I'll have a better look this evening.

nielsdos commented 5 months ago

It's always going to be possible to choose costs that result in an overflow one way or another. Using deltas helps in these example cases because these are fairly legit: you don't want to use a substitution, this is a legit concern. However, using deltas has a noticeable performance penalty of about 15% in my tests for moderately long strings. One pragmatic solution may be to adjust the costs behind the scenes, e.g. change the maximum to be lowest+second_lowest+1 (for positive costs) which would fix non-pathalogical cases.

cmb69 commented 2 months ago

Hmm, I wonder if there are practical applications of high (absolute numbers of) cost values at all. I mean, usually(?) you'd pass very small numbers; so if we would restrict these values, that might not break any reasonable code using the levenshtein() function.

Another option might be to do the internal calculations with floats/doubles. That would be less precise (and probably slower), but at least you won't get completely wrong results.

If we want to avoid any breakage, we may instead choose to just document the issue.

Frankly, I see fee (if any) cases where levenshtein() can be used for proper texts since it is practically constrained to the 26 latin letters (even mixing lower and upper case letters wouldn't make much sense), and I don't know if there are any other applications. So it's probably not worth spending too much time on this issue (nonetheless I'd like to see it resolved in some way), since the reported bugs appear more like the results of failed attempts to find a security issue, than an issue stemming from a pratical application.

nielsdos commented 2 months ago

I made two possible suggestions on how to mitigate it here: https://github.com/php/php-src/pull/14809#issuecomment-2211210951

I also thought about using floats / doubles but didn't pursue that due to the precision loss.

If we want to avoid any breakage, we may instead choose to just document the issue.

That's a possibility too, not sure though.

One interesting use case which I have seen for levenshtein is sequence alignment cost for DNA or proteins.