Open ampli opened 2 years ago
I spent the last 15 minutes looking at this, but don't see a clear answer. I see this:
hist_*
defines to check for overflow. That has a performance impact. parse_count_clamp()
, review each use of hist_*
and determine which ones need to be checked, and add parse_count_clamp()
back in, as needed; remove all other uses of parse_count_clamp()
parse_count_clamp()
to see if it's actually needed. This risks missing a location that should have been checked.INT_MAX
to 2^29-1 and crossing ones fingers and hoping that's enough.I think your proposal 2. is my (a) and your proposal 3. is my (d)
parse/histogram.h
I don't like the muddle.The question is really "do I feel like a perfectionist today?" and "Do I have the mental energy to do this?" Maybe I should try, and you can review my work?
(b) Comment out all uses of parse_countclamp(), review each use of hist* and determine which ones need to be checked, and add parse_count_clamp() back in, as needed; remove all other uses of parse_count_clamp() (c) Some muddled alternative to (b), involving reviewing each use of parse_count_clamp() to see if it's actually needed. This risks missing a location that should have been checked.
I already mostly analyzed all of that in my post above, and my conclusion was that the changes I proposed in (2) will fix the problem:
parse_count_clamp()
call marked as "OVERFLOW1".parse_count_clamp()
calls: Just before using leftcount
and rightcount
in a multiplication
(just before the lines if (0 < hist_total(&leftcount))
and if (0 < hist_total(&rightcount))
).parse_count_clamp()
intact.I forgot to analyze the impact of line 1390 after applying my proposed fix.
https://github.com/opencog/link-grammar/blob/c1b15b8d937c1da7bfdf7f82ef931888f50b1a38/link-grammar/parse/count.c#L1390
In that case, leftcount
is clipped. total
is already clipped from the existing call to parse_count_clamp()
at the end of the match loop, and count
is read from a clipped table entry (or do_count()
result, which is clipped). So the max. total is (2^31-1)
+ ((2^31-1)
(2^31-1)
) < 2^62
.
Then the multiplication
https://github.com/opencog/link-grammar/blob/c1b15b8d937c1da7bfdf7f82ef931888f50b1a38/link-grammar/parse/count.c#L1428
can add to it at most ((2^31-1)
(2^31-1)
) < 2^62
, so the result is still less than 2^63
.
(it's very subtle, I hope I didn't miss something...).
(d) changing INT_MAX to 2^29-1 and crossing ones fingers and hoping that's enough. It seems to me it may not hurt to do it too.
In any case, I can send a PR for my proposal if it seems fine to you. In case you choose to do it yourself, I can review it of course.
Yes, send a pull req. At each location, add comments such as /* can add at most ((2^31-1) * (2^31-1)) < 2^62, so the result is still less than 2^63 */
... because it is subtle, and hard to review.
I said:
- The
parse_count_clamp()
call that has a debug printout "OVERFLOW1" is not needed since the loop can be performed no more than(2^8 * 2)
times
The number (2^8 * 2)
is incorrect, since the loop is over the disjunct. But assuming less than 2^31 disjunct per word, an overflow here is still impossible.
It doesn't seem reasonable that the program can ever handle billions of disjuncts per word, when a large part of them leads to a very high count number (overflow or near overflow). So at most, a sanity check can be done (elsewhere, when the disjuncts are counted for creating the disjunct memory block) that the number of disjuncts is less than 2^31.
a sanity check can be done
I didn't implement that, seems just as an unneeded overhead...
billions of disjuncts per word,
That won't happen. There won't even be a million.
However, in the earlier days, I had SQL dicts that had a hundred UNKNOWN-WORD
, each of these with maybe 10K disjuncts. These caused combinatoric overflows and parsing timeouts, even if I set parse-timeout=600
There won't even be a million
In generations mode, we have ~3M disjuncts per word (in the middle of sentences). However, as we know this causes a vast slowness... The speed can be improved (a WIP), but the real solution is to implement the disjunct sampling method you have suggested. However, this will be for a future version (i.e. not for 5.11.0).
While running tests on "amy" with ASN/UBSAN, I got this:
This just happens to never occur with the other languages and the various corpus-* files, but theoretically, it could happen due to the current code.
All the calculations are done in signed 64-bit (the storage is in signed 32-bit due to a recent change, but this has no implications here). It is signed 64-bit due to historical reasons, but it can be a good idea to keep it signed because overflows could then be detected by the compiler's UBSAN checks.
The problem arises from the clipping value. It is
INT_MAX
, which is2^31-1
. Observe the following code: https://github.com/opencog/link-grammar/blob/c1b15b8d937c1da7bfdf7f82ef931888f50b1a38/link-grammar/parse/count.c#L1371-L1381 Each of the 4 do_count() calls may return INT_MAX, and henceleftcount
(and similarlyrightcount
) can be up to4*(2^31-1)
=2^33-4
.Then we have this code: https://github.com/opencog/link-grammar/blob/c1b15b8d937c1da7bfdf7f82ef931888f50b1a38/link-grammar/parse/count.c#L1390 We may get here up to (
total
+(2^33-4)
*(2^31-1)
) which may be >2^63
.And we also have this: https://github.com/opencog/link-grammar/blob/c1b15b8d937c1da7bfdf7f82ef931888f50b1a38/link-grammar/parse/count.c#L1428 So we may get here (
total
+(2^33-4)
*(2^33-4)
) = which is much more than2^63
(andtotal
here may already be near or >2^63
.Possible solutions:
leftcount
andrightcount
before using them in the multiplications.parse_count_clamp()
, clamp to2^29-1
.(2) and (3) will add a slight overhead, but by analyzing the overflow possibilities I found some places in which the efficiency can be improved:
CACHE_COUNT()
, c is unnecessarily too wide, andcount_t
can be used instead. However (I didn't check) maybe the compiler already does such an optimization.parse_count_clamp()
call that has a debug printout "OVERFLOW1" is not needed since the loop can be performed no more than(2^8 * 2)
times, and accumulate no more than(2*31-1) per iteration (max. total ~
2^40)). To the final
parse_count_clamp()("OVERFLOW2"), after potentially accumulation up to additional
2^31-1` is enough.>>>>>>>> For now, I would choose option (2).
@linas, Please check if my analysis is correct, and I will send a PR (if needed).