opencog / link-grammar

The CMU Link Grammar natural language parser
GNU Lesser General Public License v2.1
389 stars 118 forks source link

Potentially undetected count overflow in do_count() #1325

Open ampli opened 2 years ago

ampli commented 2 years ago

While running tests on "amy" with ASN/UBSAN, I got this:

... amy/link-grammar/parse/count.c:1390:7: runtime error: signed integer overflow: 2147483647 * 5569944920 cannot be represented in type 'long int'
... amy/link-grammar/parse/count.c:1390:7: runtime error: signed integer overflow: -7518702831433443131 + -6485378441171344729 cannot be represented in type 'long int'

This just happens to never occur with the other languages and the various corpus-* files, but theoretically, it could happen due to the current code.

All the calculations are done in signed 64-bit (the storage is in signed 32-bit due to a recent change, but this has no implications here). It is signed 64-bit due to historical reasons, but it can be a good idea to keep it signed because overflows could then be detected by the compiler's UBSAN checks.

The problem arises from the clipping value. It is INT_MAX, which is 2^31-1. Observe the following code: https://github.com/opencog/link-grammar/blob/c1b15b8d937c1da7bfdf7f82ef931888f50b1a38/link-grammar/parse/count.c#L1371-L1381 Each of the 4 do_count() calls may return INT_MAX, and hence leftcount (and similarly rightcount) can be up to 4*(2^31-1) = 2^33-4.

Then we have this code: https://github.com/opencog/link-grammar/blob/c1b15b8d937c1da7bfdf7f82ef931888f50b1a38/link-grammar/parse/count.c#L1390 We may get here up to (total + (2^33-4) * (2^31-1) ) which may be > 2^63 .

And we also have this: https://github.com/opencog/link-grammar/blob/c1b15b8d937c1da7bfdf7f82ef931888f50b1a38/link-grammar/parse/count.c#L1428 So we may get here (total + (2^33-4) * (2^33-4)) = which is much more than 2^63 (and total here may already be near or > 2^63.

Possible solutions:

  1. Do nothing, as this has empirically proved not to be a problem with "real" dicts.
  2. Clamp leftcount and rightcount before using them in the multiplications.
  3. In parse_count_clamp(), clamp to 2^29-1.

(2) and (3) will add a slight overhead, but by analyzing the overflow possibilities I found some places in which the efficiency can be improved:

  1. In the macro CACHE_COUNT(), c is unnecessarily too wide, and count_t can be used instead. However (I didn't check) maybe the compiler already does such an optimization.
  2. The parse_count_clamp() call that has a debug printout "OVERFLOW1" is not needed since the loop can be performed no more than (2^8 * 2) times, and accumulate no more than (2*31-1) per iteration (max. total ~2^40)). To the finalparse_count_clamp()("OVERFLOW2"), after potentially accumulation up to additional2^31-1` is enough.

>>>>>>>> For now, I would choose option (2).

@linas, Please check if my analysis is correct, and I will send a PR (if needed).

linas commented 2 years ago

I spent the last 15 minutes looking at this, but don't see a clear answer. I see this:

I think your proposal 2. is my (a) and your proposal 3. is my (d)

The question is really "do I feel like a perfectionist today?" and "Do I have the mental energy to do this?" Maybe I should try, and you can review my work?

ampli commented 2 years ago

(b) Comment out all uses of parse_countclamp(), review each use of hist* and determine which ones need to be checked, and add parse_count_clamp() back in, as needed; remove all other uses of parse_count_clamp() (c) Some muddled alternative to (b), involving reviewing each use of parse_count_clamp() to see if it's actually needed. This risks missing a location that should have been checked.

I already mostly analyzed all of that in my post above, and my conclusion was that the changes I proposed in (2) will fix the problem:

I forgot to analyze the impact of line 1390 after applying my proposed fix. https://github.com/opencog/link-grammar/blob/c1b15b8d937c1da7bfdf7f82ef931888f50b1a38/link-grammar/parse/count.c#L1390 In that case, leftcount is clipped. total is already clipped from the existing call to parse_count_clamp() at the end of the match loop, and count is read from a clipped table entry (or do_count() result, which is clipped). So the max. total is (2^31-1) + ((2^31-1) (2^31-1)) < 2^62. Then the multiplication https://github.com/opencog/link-grammar/blob/c1b15b8d937c1da7bfdf7f82ef931888f50b1a38/link-grammar/parse/count.c#L1428 can add to it at most ((2^31-1) (2^31-1)) < 2^62, so the result is still less than 2^63. (it's very subtle, I hope I didn't miss something...).

(d) changing INT_MAX to 2^29-1 and crossing ones fingers and hoping that's enough. It seems to me it may not hurt to do it too.

In any case, I can send a PR for my proposal if it seems fine to you. In case you choose to do it yourself, I can review it of course.

linas commented 2 years ago

Yes, send a pull req. At each location, add comments such as /* can add at most ((2^31-1) * (2^31-1)) < 2^62, so the result is still less than 2^63 */ ... because it is subtle, and hard to review.

ampli commented 2 years ago

I said:

  1. The parse_count_clamp() call that has a debug printout "OVERFLOW1" is not needed since the loop can be performed no more than (2^8 * 2) times

The number (2^8 * 2) is incorrect, since the loop is over the disjunct. But assuming less than 2^31 disjunct per word, an overflow here is still impossible. It doesn't seem reasonable that the program can ever handle billions of disjuncts per word, when a large part of them leads to a very high count number (overflow or near overflow). So at most, a sanity check can be done (elsewhere, when the disjuncts are counted for creating the disjunct memory block) that the number of disjuncts is less than 2^31.

ampli commented 2 years ago

a sanity check can be done

I didn't implement that, seems just as an unneeded overhead...

linas commented 2 years ago

billions of disjuncts per word,

That won't happen. There won't even be a million.

However, in the earlier days, I had SQL dicts that had a hundred UNKNOWN-WORD, each of these with maybe 10K disjuncts. These caused combinatoric overflows and parsing timeouts, even if I set parse-timeout=600

ampli commented 2 years ago

There won't even be a million

In generations mode, we have ~3M disjuncts per word (in the middle of sentences). However, as we know this causes a vast slowness... The speed can be improved (a WIP), but the real solution is to implement the disjunct sampling method you have suggested. However, this will be for a future version (i.e. not for 5.11.0).