openlink / virtuoso-opensource

Virtuoso is a high-performance and scalable Multi-Model RDBMS, Data Integration Middleware, Linked Data Deployment, and HTTP Application Server Platform
https://vos.openlinksw.com
Other
856 stars 210 forks source link

Undefined shift conditions #538

Open mmcco opened 8 years ago

mmcco commented 8 years ago

Using Coccinelle, I came across two conditions that will likely be optimized away due to undefined behavior:

This is because the only possible outcomes are:

On a related note, it looks like updating PCRE might be a good idea. The bundled version looks pretty old.

HughWilliams commented 8 years ago

What is Coccinelle ... Is it this http://coccinelle.lip6.fr ?

Do you have test cases to demonstrate these possibly outcomes ?

mmcco commented 8 years ago

I don't have test cases - I don't use Virtuoso. I just came across this in the process of en masse auditing.

My point is that, because of how C defines shift operations, that expression doesn't do what you think it does. It's likely that the compiler will simply optimize it out of existence. If it's present in the generated code, it will very likely just evaluate to something other than zero, even when you expect all the bits to be shifted out.

IvanMikhailov commented 8 years ago

I guess I know C to a degree but I can't find a reason for an undefined behavior in both cases. More correctly, they give defined but platform-specific results on processors with non-eight-bit bytes, but I don't care.

mmcco commented 8 years ago

I got the PCRE one fixed upstream.

What exacly is rtol() supposed to do? Maybe you're using a && where you mean to be using a &?

In regards to this:

I guess I know C to a degree but I can't find a reason for an undefined behavior in both cases. More correctly, they give defined but platform-specific results on processors with non-eight-bit bytes, but I don't care.

Nope. The ANSI C standard specifies the following as undefined:

An expression is shifted by a negative number or by an amount greater than or equal to the width of the promoted expression (6.5.7).

Architectures may have a defined behavior for that, but the point is that the compiler doesn't have any obligation to emit these instructions in any particular way. It's likely that that subexpression is optimized out completely.

IvanMikhailov commented 8 years ago

Wow! The second is clearly an error, but other than any "undefined".

Traditionally, rtol is bit rotator: it moves all bits to the left by 1, except the leftmost bit that is moved all bit-length around to the right. So if you apply rotl(x) as many times as many bits are in the type of x, the final result is x again. Its' convenient for making "weak but quick" hashes: it preserves the bit statistics, it fills in high bits even if the hashed sequence is full of low-bit values, it's cheap.

So let's look closer at

#define rotl(x) (x && (1 << (8*sizeof(x)-1)) ? (x << 1) & 1 : x << 1)

equivalent of

#define rotl(x) (  (x && (1 << (8*sizeof(x)-1)))   ?   ((x << 1) & 1)   :   (x << 1))

rotl(0) goes "else" branch hence gives 0 as 0 << 1 . OK. rotl(x) with highest bit not set gives (x << 1) --- the highest 0 appears at the right end, predictably. OK. rotl(x) with highest bit set gives (x << 1) & 1 --- error, it should give (x << 1) | 1, otherwise it drops all bits.

Thus the hash based on this "funny rotl" works but it works worse than it should: it's biased to zero bits in the result. The worst consequence is that up to 1/8 of values can be zero, flooding one item of the hashtable.

Fortunately, the rest of Virtuoso uses

#define ROL(h) ((h << 1) | ((h >> 31) & 1))

both valid and faster.

What tool did you use to catch that error?

mmcco commented 8 years ago

The second is clearly an error, but other than any "undefined".

I think you're misunderstanding me. When used as a boolean (which it is, because you use it as a subexpression for &&), (1 << (8*sizeof(x)-1)) technically means:

if (the shift overflows) {
    do undefined nonsense;
} else {
    true;
}

Clearly, no sane programmer would conditionally trigger undefined behavior, and no sane programmer would write (x && (1 << (8*sizeof(x)-1)) ? (x << 1) & 1 : x << 1) when they meant (x ? (x << 1) & 1 : x << 1). Therefore, a mistake was apparently made here (I was right about this), and I wanted to point it out.

What tool did you use to catch that error?

See above (Coccinelle). It lets you scan large amounts of code for a pattern you know to be problematic.