nasa / cFE

The Core Flight System (cFS) Core Flight Executive (cFE)
Apache License 2.0
408 stars 200 forks source link

CFS violates strict aliasing in SB and ES #2601

Open dallen-osr opened 6 days ago

dallen-osr commented 6 days ago

GCC and Clang by default enable optimizations for strict aliasing. This is a problem for CFS users because the way that CFS is architected violates strict aliasing rules - for example, in SB's use of ES memory pools. GCC and clang can be tuned with the -fno-strict-aliasing flag to prevent this from causing undefined behavior.

Data Structures

For the sake of clarity, I will describe 2 data structures used in my examples below here:

typedef union CFE_SB_Msg {
    /* \brief Base message type without enforced alignment */
    CFE_MSG_Message_t Msg; 
    /* \brief Align to support Long Integer */
    long long int LongInt;
    /* \brief Align to support double. */
    long double Double;
} CFE_SB_Buffer_t;

CFE_SB_Buffer_t is a union of CFE_MSG_Message_t and the types with the largest alignment requirements for the platform. CFE_MSG_Message_t is a union of CCSDS_SpacePacket_t and a byte array for its serialized representation.

typedef struct CFE_SB_BufferD {
    /* Links for inclusion in the tracking lists */
    CFE_SB_BufferLink_t Link;
    /* Actual MsgId of the content, cached here to avoid repeat */
    CFE_SB_MsgId_t MsgId;
    /* Current owner of the buffer, if owned by a single app. */
    CFE_ES_AppId_t AppId;
    /* Total size of this descriptor (including descriptor itself) */
    size_t AllocatedSize;
    /* Actual size of message content currently stored in the buffer */
    size_t ContentSize;
    /* Type of message content currently stored in the buffer */
    CFE_MSG_Type_t ContentType;
    /* If message should get its sequence number assigned from the route */
    bool AutoSequence;
    /* Number of active references to this buffer in the system */
    uint16 UseCount;
    /* Variably sized content field, Keep last */
    CFE_SB_Buffer_t Content;
} CFE_SB_BufferD_t;

I think it is pertinent to note that CFE_SB_Buffer_t Content is the last field of CFE_SB_BufferD_t, since that may have bearing on the common initial sequence rule of strict aliasing.

An example using CFE_SB_ReceiveBuffer

To see why this data structure is a problem with strict aliasing, I'll use CFE_SB_ReceiveBuffer as an example:

CFE_SB_ReceiveBuffer takes a pointer to a pointer to CFE_SB_Buffer_t, BufPtr, whose value is *BufPtr, and a queue identifier, PipeId.

CFE_SB_ReceiveBuffer retrieves a buffer from the OSAL queue associated with PipeId, calling OS_QueueGet with CFE_SB_BufferD_t BufDscPtr as its data parameter.

OS_QueueGet internally uses OS_QueueGet_Impl - for examle, the posix version in file osal/src/os/posix/src/os-impl-queues.c. OS_QueueGet_Impl calls mq_receive, using data (CFE_SB_BufferD_t BufDscPtr). Now we must examine whence comes data, via CFE_SB_TransmitMsg.

How Messages are stored in SB

CFE_SB_TransmitMsg is called with a pointer to a structure whose first member is CFE_MSG_Message_t - i.e. a struct whose first member is either CFE_MSG_CommandHeader_t or CFE_MSG_TelemetryHeader_t. Encoded in this data is the total length of the message, which extends beyond CFE_MSG_Message_t. In order to send this message, SB must acquire a large enough buffer from the memory pool using CFE_SB_GetBufferFromPool.

CFE_SB_GetBufferFromPool calls CFE_ES_GetPoolBuf. The memory in this pool ultimately derives from a global variable CFE_SB_Global_t CFE_SB_Global, in its CFE_SB_MemParams_t Mem member, in that member's CFE_ES_STATIC_POOL_TYPE(CFE_PLATFORM_SB_BUF_MEMORY_BYTES) Partition member. CFE_ES_STATIC_POOL_TYPE is a macro defined in cfe_es_api_typedefs.h which is a union of array of uint8 and CFE_ES_PoolAlign_t, which is a union of void*, long long int, and long double.

All this to say that the backing buffer is a byte array. I should at this point stress that by the standard, uint8 and char are not equivalent - and unsigned char and signed char are BOTH distinct from char. This has important ramifications for the particulars of the strict aliasing rule.

Once CFE_SB_TransmitMsg acquires its backing buffer from the memory pool, it copies the message passed to it into a BufDscPtr, and calls CFE_SB_BroadcastBufferToRoute with that BufDscPtr. CFE_SB_BroadcastBufferToRoute in turn ultimately calls OS_QueuePut with that data. OS_QueuePut calls OS_QueuePut_Impl, which calls mq_timedsend - with the memory pool data that is a byte array.

Violation of Strict Aliasing

CFE_SB_ReceiveBuffer assigns to *BufPtr the address of BufDscPtr->Content., which has been obtained from OS_QueueGet, and resides in SB's memory pool as a portion of a byte array. When the caller gets this data, it's represented as a pointer to a CFE_SB_Buffer_t - but really is an arbitrary location in a buffer array in SB. There is a subtle problem with this even before the caller casts it to another type: it already violates strict aliasing.

Per C99 (WG14/N1256, ISO/IEC 9899:TC3, September 7, 2007), section 6.5.7:

An object shall have its stored value accessed only by an lvalue expression that has one of the following types: 76

-- a type compatible with the effective type of the object,

-- a qualified version of a type compatible with the effective type of the object,

-- a type that is the signed or unsigned type corresponding to the effective type of the object,

-- a type that is the signed or unsigned type corresponding to a qualified version of the effective type of the object,

-- an aggregate or union type that includes one of the aforementioned types among its members (including, recursively, a member of a subaggregate or contained union), or

-- a character type.

76 The intent of this list is to specify those circumstances in which an object may or may not be aliased.

As mentioned previously, a uint8 array (unioned with void* and largest system types) is not a CFE_SB_Buffer_t or a CFE_MSG_Message_t, and is not char. The question becomes, is it acceptable when the primary header is defined as a byte array below its type as a structure? I don't believe this is allowable because it would subvert the type system too much.

It would be completely fine to only use these buffers to transmit and receive data into local copies - but there are also "zero-copy" versions of these functions whose use would be rendered useless by that paradigm, and incur a performance penalty for additional copies. It's frequently noted in discussions related to strict aliasing that when the compiler sees a memcpy where type punning is desired, it will eliminate that: I doubt that optimization is easy to perform with the level of indirections present in CFS and OSAL.

A problem indisputably arises where the CFE_SB_Buffer (with data) stored in the memory pool is accessed via a pointer cast to another type - e.g. some command structure with additional payload data.

Final observations and conclusions.

SB is not the sole user of ES Memory Pools, and the pattern of using byte arrays for various types is a common pattern in flight software in the CFS ecosystem. I think it is prudent to enable the -fno-strict-aliasing flag in all CFS modules, OSAL, PSP, and all apps, to mitigate undefined behavior arising from violating type based aliasing.

Additional Reading

The Strict Aliasing Situation is Pretty Bad What is the Strict Aliasing Rule and Why do we care? gcc may_alias attribute

jphickey commented 6 days ago

Thanks for the feedback. I've investigated the strict aliasing rules and I'm not convinced that anything actually violates them. I can't think of any specific case in the flow where type-punning (via pointer casting) is done. As noted the code makes use of union types but these are all designed to be compatible when overlaid (that is, the union contains a member of the actual type, and the internal data members are accessed through that type).

The only thing I can think of here is related to the way that the pools are managed, in that they use uint8 internally where it could be more correct to use char perhaps?

Do you get any specific warnings or information from the compiler when you compile with -fstrict-aliasing or have you noticed any incorrect behavior when compiled with that option?

jphickey commented 6 days ago

Keep in mind the intent of the aliasing rules here. If there were no rules, then any data "write" could change the value of any other data. This means that the optimizer cannot hold the value of some object in e.g. a register - if any assignment can change any value, then it constantly has to actually [re-]read the value from memory every time it needs it.

The aliasing rules aim to pose some limitations on that based on type - that is, knowing that only certain types of data writes can change a value, it can rely on cached copies of values (again, likely in a CPU register or similar) and not have to go to main memory as often.

Notably -- the compiler/optimizer doesn't necessarily care how you came up with the address of some object in memory. The point is, once you have identified the object as a given type and started using it as that type, that you only access it through pointers of the same (or compatible) type for the lifetime of the object. I don't believe the pools are violating this, because although the pool memory was allocated as a uint8 and the address(es) were calculated using byte offsets, once an address is determined, it casts this to the intended type, then initialized as that type, and from thereon is only accessed as that type or via a compatible type.

dallen-osr commented 6 days ago

have you noticed any incorrect behavior when compiled with that option?

I have not. I think currently analysis fails because of the amount of indirection and SB and ES being separate modules. I think the tooling in general is insufficient because this is a very difficult problem to handle.

specific case in the flow where type-punning (via pointer casting) is done

In CFE_EVS_ProcessCommandPacket, in cfe_evs_dispatch.c:

        case CFE_EVS_SEND_HK_MID:
            /* Housekeeping request */
            CFE_EVS_SendHkCmd((const CFE_EVS_SendHkCmd_t *)SBBufPtr);
            break;

However, that data is not accessed. In CFE_EVS_ProcessGroundCommand:

        case CFE_EVS_ENABLE_EVENT_TYPE_CC:

            if (CFE_EVS_VerifyCmdLength(&SBBufPtr->Msg, sizeof(CFE_EVS_EnableEventTypeCmd_t)))
            {
                Status = CFE_EVS_EnableEventTypeCmd((const CFE_EVS_EnableEventTypeCmd_t *)SBBufPtr);
            }
            break;

In CFE_EVS_EnableEvntTypeCmd in cfe_evs_task.c:

    const CFE_EVS_BitMaskCmd_Payload_t *CmdPtr = &data->Payload;

This is accessing the data in the uint8 array that's stored in an offset in the ES memory pool in SB. This doesn't obey common initial sequence bcasuse it comes from the BufDscPtr, which has SB_Buffer_t as its last member (and must because of the variable data which follows it, which is not described by a type other than the raw byte array).

The only thing I can think of here is related to the way that the pools are managed, in that they use uint8 internally where it could be more correct to use char perhaps?

I think using char is one necessary step, to keep using strict aliasing. I believe the other part is that for SB, buffers and buffer descriptions must be implemented to have separate queues and pools, so the common initial sequence isn't violated... but I will admit, the rules are fuzzy to me here, so that I'm not sure that approach is correct either.

Keep in mind the intent of the aliasing rules here. If there were no rules, then any data "write" could change the value of any other data. This means that the optimizer cannot hold the value of some object in e.g. a register - if any assignment can change any value, then it constantly has to actually [re-]read the value from memory every time it needs it.

I believe that, but I also believe that the current way type based aliasing analysis is handled by gcc is too aggressive. Linus Torvalds has this position, and it's why the linux, OpenBSD, and FreeBSD kernels are compiled with -fno-strict-aliasing.

I don't believe the pools are violating this, because although the pool memory was allocated as a uint8 and the address(es) were calculated using byte offsets, once an address is determined, it casts this to the intended type, then initialized as that type, and from thereon is only accessed as that type or via a compatible type.

I don't think that is correct. I believe the effective type is in fact a uint8 array, and that only pointers acquired via malloc (which itself cannot be implemented in strictly conforming C) may acquire an effective type using that kind of mechanism.

dallen-osr commented 6 days ago

Minimal example on godbolt where icc is available: https://godbolt.org/z/183oKP483 icc considers this practice to violate aliasing rules. gcc and clang do not report it as violating the rules.

jphickey commented 5 days ago

Minimal example on godbolt where icc is available: https://godbolt.org/z/183oKP483 icc considers this practice to violate aliasing rules. gcc and clang do not report it as violating the rules.

I tend to agree with gcc/clang here. I don't see this as violating the rules, at least in the code that is directly in the example. However, if one were to access the buffer directly (as uint8s) and also access it through the Object* pointer, then it would violate the rules. But the simple promotion of a Header* to an Object* (which has a Header as the first element) should not violate the rules, in and of itself. Otherwise OO concepts would be impossible.

dallen-osr commented 5 days ago

gcc actually does complain about it, when called with the flag -Wstrict-aliasing=1 or -Wstrict-aliasing=2. By gcc's man page, 1 is the most aggressive, and least accurate, 2 is aggressive and less accurate, and 3 has the fewest false positives and false negatives. As noted in the gist (What is the Strict Aliasing Rule), it is possible to have aliasing violations that are not caught by the compiler (https://wandbox.org/permlink/dwd9jhy53AF7a2D0 line 15 is not caught unlike 9 and 10). I don't think it's good enough to simply rely on the compiler to tell us when we're right or wrong, because compiler vendors are under no obligation to warn about any and all undefined behavior.

Clang does not properly implement the strict aliasing warnings despite allowing the flags (https://github.com/llvm/llvm-project/blob/main/clang/test/Misc/warning-flags-tree.c#L16). It should also be noted that clang's user manual claims that -fno-strict-aliasing is the default (https://clang.llvm.org/docs/UsersManual.html), but I think that's probably a case of stale documentation.

https://gustedt.wordpress.com/2016/08/17/effective-types-and-aliasing/

A member of WG14 states in this blog post that character arrays must not be reinterpreted as objects of other types. This confirms what I understand as the effective type of the memory pool in ES as being a uint8 array.

https://godbolt.org/z/nhbqjjjP8

gcc with -Wstrict-aliasing=2 consistently warns that the pattern of casting the record's header to an arbitrary object type might break strict aliasing rules. It is also possible to blatantly violate aliasing without a warning being issued.

https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html#index-malloc-function-attribute https://clang.llvm.org/docs/AttributeReference.html#malloc

Maybe using attribute malloc would be appropriate, since the way buffer descriptors work is analagous to the metadata prefixing pointers returned by malloc?

Another approach is that CFE_SB_Buffer_t, or a wrapping type, could be a union of every message type that uses it, and that is the type stored in the memory pool. This would be difficult to maintain but technically correct and, as I understand it, not violating strict aliasing. There is a very significant downside to this approach in that memory use could explode.

jphickey commented 5 days ago

Note that the message generated by gcc is this:

dereferencing type-punned pointer might break strict-aliasing rules

With the operative word here being "might". They are just flagging something that could be wrong, but the analysis to determine if it is actually wrong or not must be done by the programmer, not the compiler.

The reality is that all access to the header fields is still done through an object of type struct Header and thus not breaking the rules. It would only break the rules if you wrote to one and read from the other, while expecting to see the changes from the previous write.

An important thing to note is that strict (pedantic?) adherence to some of these rules would result in highly inefficient and/or impossible to maintain code. There is constant friction between standards writers, compiler developers, and software developers to find a happy middle ground between stuff that's possible and stuff that's safe and stuff that's efficient.

While one could argue that simply allocating a block of memory as uint8 (or even char) and then using it as a different type is violating the strict sense of the aliasing rules, I've never come across a compiler or system for which it didn't work correctly/as expected, provided that it was never actually accessed as a uint8 but only accessed via the intended data type. In contrast, I have definitely come across issues where code did not behave as expected, and it was tracked down to a type-punned pointer, but in all those cases the software was mixing reads and writes to the different pointer types and expecting to see the change reflected in the other one (the alias). But as long as the code only accesses the object through one type at a time, it always works.

dallen-osr commented 5 days ago

I'll preface this by saying that I don't think any compiler today is doing anything to break the expectations of CFS as it is written, and that I doubt it will be a problem for some time. I will also say that I would prefer the standard to be written in such a way that it is unambiguous that static byte arrays may be treated as storage for objects.

I am being pedantic because my perception is that compiler developers favor being as pedantic as possible, and sometimes even disregarding the standard in favor of optimizations: https://gcc.gnu.org/legacy-ml/gcc/2010-01/msg00263.html

Having said all this, the original poster is correct that GCC doesn't implement C99 aliasing as written in the standard regarding unions. We don't do so because we determined that this can't possibly have been the intent of the standard as it makes type-based aliasing relatively useless.

The reality is that all access to the header fields is still done through an object of type struct Header and thus not breaking the rules

I think this is true in SB and MSG inspecting a message returned by CFE_SB_ReceiveMessage, but when that message is used after casting it, it breaks the rules since it doesn't obey any common initial sequence and ES/SB has no information about the type written into the array at that location, in the bytes beyond the CFE_SB_Buffer_t Content member of the buffer file descriptor. I gave an example previously of EVS casting to const CFE_EVS_EnableEventTypeCmd_t *. I actually should have noted SB itself in cfe_sb_task.c dereferences a buffer pointer cast to CFE_SB_EnableRouteCmd_t *. I don't think there is any reading of the aliasing rules in which this is legal behavior, regardless of the lack of visible compiler complaints.

The problem with relying on code doing what has been observed, which violates rules, is that the behavior can be legally changed by future compiler versions.

jphickey commented 4 days ago

The problem with relying on code doing what has been observed, which violates rules, is that the behavior can be legally changed by future compiler versions.

I don't think so. Such a change would break every call to e.g. read() / write() or many other system libraries that pass objects via void pointers or an opaque memcpy(). There is a massive amount of historical precedent for code like this in C, and the compiler authors can't diverge that much from established paradigms over 40 years.

This is actually part of the reason for using C in the first place, is that it gives you low(-ish) level control over how and where your objects appear in memory. If they broke that for whatever reason, they'd have to give the resulting language a new name, like Rust or something. It wouldn't be C anymore.

dallen-osr commented 4 days ago

Functions like read and stdlib calls like memcpy get special privilege by virtue of being implemented through system calls and assembly. Also for memcpy, it's completely fine to alias any object to (not from) a char *. The standard made this part of the rules because it would break everything. The chief problem here is going from a byte array to an object.

memcpy is implemented partially in assembly. Using the i386 path in glibc as an example, it invokes macros for BYTE_COPY_FWD and PAGE_COPY_FWD, which wraps WORD_COPY_FWD. These macros are implemented as assembly, which the compiler can't treat with the same semantic analysis as if they were written normal C expressions. https://github.com/lattera/glibc/blob/master/string/memcpy.c https://github.com/lattera/glibc/blob/master/sysdeps/i386/memcopy.h#L26

read, at least on linux, is implemented as a system call, and linux itself is built with the -fno-strict-aliasing flag: https://lore.kernel.org/all/b3itcd$2bi$1@penguin.transmeta.com/ I can't speak too much to exactly how the posix syscalls are implemented in VxWorks and RTEMS since my experience with them is much more limited, but I would suppose that they're pretty careful about avoiding UB.

If they broke that for whatever reason, they'd have to give the resulting language a new name, like Rust or something. It wouldn't be C anymore.

I appreciate the joke - and I'll take the opportunity to mention that rust is currently a no-go for CFS and flight software in general because of a lack of first class support for platforms that aren't Linux, OSX, or Windows on arches that aren't x86 or ARM. There's also problems about how panics work and some other concerns, but suffice it to say that it's a major engineering effort to make rust amenable to flight software environments.