llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
27.84k stars 11.46k forks source link

Clang: size of struct with FAM with initialization #62929

Open uecker opened 1 year ago

uecker commented 1 year ago

GCC has an extension for statically initialization of structs with a flexible-array member which clang also supports. Consider this example:

struct foo { int a; char b; char c[]; } x = { .c = { 1, 2, 3, 4 }; };

https://godbolt.org/z/jKxbzEvdo

GCC would allocate 12 bytes for this on x86_64 which is the same as the size of a struct with replacement array. This also corresponds to what programmers typically allocate on the heap for such structs using the size given by sizeof(struct foo) + n * sizeof(char).

Starting with Clang 15 Clang only allocates 9 bytes for such struct. There are three potential concerns with this:

  1. it is not the same as what the GCC extension does
  2. it is not what the common rule for computing the size yields, so when using memcpy with this rule on such a struct this may cause an out-of-bounds access
  3. it is not compatible with the rule in the C standard that says that a struct with FAM behaves as if replaced with the largest array that would not make the struct larger than the object. But the largest such array in the example above would hold only three elements (size 8), not four (size 12).

GCC bug: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109956

shafik commented 1 year ago

CC @AaronBallman

llvmbot commented 1 year ago

@llvm/issue-subscribers-clang-codegen

efriedma-quic commented 1 year ago

GCC doesn't consistently emit padding; the following doesn't have any padding.

struct bar { int a; char b; char x[3]; char c[]; } x = { .c = { 1, 2, 3 } };

Older versions of Clang don't consistently emit padding either; whether there's padding, or how much padding, depends on the type/initializer/target architecture/etc. When I discovered the bug in Clang, I decided to go with the simplest solution, which was just "no padding". Other amounts of padding are possible, I guess, but I'm not sure what the point would be.

it is not what the common rule for computing the size yields, so when using memcpy with this rule on such a struct this may cause an out-of-bounds access

The correct amount to copy is just the offset of the array plus the size of the array. What "common rule" are you using to compute the memcpy size?

efriedma-quic commented 1 year ago

Oh, wait, is the GCC rule just "always emit sizeof(struct) + sizeof(array) bytes"? So for the following:

struct foo { int a; char b; char c[]; } x = { .c = { 1, 2, 3, 4, 5 } };

We're supposed to emit 3 bytes of padding, in case someone uses the wrong formula to compute the size? I guess we could do that... but in that case, I'm not sure why you're bringing up the C standard.

uecker commented 1 year ago

Yes, this it what GCC seems to do. What you refer to as "correct size" is the minimal size without padding. Whether this is correct or not is another question. I mentioned the C standard because in the C standard there is an example for use of the sizeof + n * sizeof (with malloc). The C standard also has some wording which says that the struct with FAM behaves as if there was the largest replacement arrays as a replacement array so that it still fits the object except that the offset may be different. This seems to imply that you can not store 4 elements in a struct of size 9, but I am also not so sure the size rule above guarantees it. In any case, what I am worried about is less about what the C standard says but that programmers use this rule also for memcpy and this may cause problems if the object is smaller than expected. (also see the GCC bug for discussion).

pascal-cuoq commented 1 year ago

For what it is worth, TrustInSoft Analyzer now implements the extension being discussed, using Clang 15's algorithm to decide the size of the static allocation. It makes sense assuming the smallest size in a static analyzer that provides guarantees about the behavior of the generated binary “for all reasonable C compilers”.

This also means that if there exist programs that use this extension and are written so that they are only safe with GCC 13 or only safe with Clang 14, these programs can be noticed and fixed. (TrustInSoft Analyzer is a commercial tool but gets applied to many open-source dependencies because of the way it works.)

$ cat initialized_fam.c
#include <tis_builtin.h>

struct s { int a; char c; char t[]; };

struct s s0 = {1, 2};
struct s s1 = {1, 2, 1};
struct s s2 = {1, 2, 1, 2};
struct s s4 = {1, 2, 1, 2, 3, 4};

struct z { int a; short s; char t[]; };

struct z z0 = {1, 2};
struct z z1 = {1, 2, 1};
struct z z2 = {1, 2, 1, 2};
struct z z4 = {1, 2, 1, 2, 3, 4};

int main(int c, char *v[]) {
  switch (c) {
      case 0:
    s0.t[0] = 0;
    break;
      case 1:
    s1.t[1] = 0;
    break;
      case 2:
    s2.t[2] = 0;
    break;
      case 3:
    s4.t[4] = 0;
    break;
      case 4:
    z0.t[0] = 0;
    break;
      case 5:
    z1.t[1] = 0;
    break;
      case 6:
    z2.t[2] = 0;
    break;
      case 7:
    z4.t[4] = 0;
    break;
  }
}
$ tis-analyzer -val initialized_fam.c | grep assert
initialized_fam.c:29:[kernel] warning: out of bounds write. assert \valid(&s4.t[4]);
initialized_fam.c:38:[kernel] warning: out of bounds write. assert \valid(&z2.t[2]);
initialized_fam.c:41:[kernel] warning: out of bounds write. assert \valid(&z4.t[4]);