Open MyNameIsRoman opened 5 months ago
This could be solved by changing the computation of optimal cost to use deltas. I'd have to check what the performance impact of that is, and if there is a better way though. I'll have a better look this evening.
It's always going to be possible to choose costs that result in an overflow one way or another.
Using deltas helps in these example cases because these are fairly legit: you don't want to use a substitution, this is a legit concern.
However, using deltas has a noticeable performance penalty of about 15% in my tests for moderately long strings.
One pragmatic solution may be to adjust the costs behind the scenes, e.g. change the maximum to be lowest+second_lowest+1
(for positive costs) which would fix non-pathalogical cases.
Hmm, I wonder if there are practical applications of high (absolute numbers of) cost values at all. I mean, usually(?) you'd pass very small numbers; so if we would restrict these values, that might not break any reasonable code using the levenshtein()
function.
Another option might be to do the internal calculations with floats/doubles. That would be less precise (and probably slower), but at least you won't get completely wrong results.
If we want to avoid any breakage, we may instead choose to just document the issue.
Frankly, I see fee (if any) cases where levenshtein()
can be used for proper texts since it is practically constrained to the 26 latin letters (even mixing lower and upper case letters wouldn't make much sense), and I don't know if there are any other applications. So it's probably not worth spending too much time on this issue (nonetheless I'd like to see it resolved in some way), since the reported bugs appear more like the results of failed attempts to find a security issue, than an issue stemming from a pratical application.
I made two possible suggestions on how to mitigate it here: https://github.com/php/php-src/pull/14809#issuecomment-2211210951
I also thought about using floats / doubles but didn't pursue that due to the precision loss.
If we want to avoid any breakage, we may instead choose to just document the issue.
That's a possibility too, not sure though.
One interesting use case which I have seen for levenshtein is sequence alignment cost for DNA or proteins.
Description
The following code:
Resulted in this output:
But I expected this output instead:
Also levenshtein(str_repeat('a', 100000), str_repeat('b', 1000000), 1, PHP_INT_MAX - 500000, 1)
returning unexpected -9223372036854625811
PHP Version
any
Operating System
any 64-bit