nasa / cFE

The Core Flight System (cFS) Core Flight Executive (cFE)
Apache License 2.0
413 stars 204 forks source link

CFE intermittently showing invalid message ID errors #1073

Closed jphickey closed 3 years ago

jphickey commented 3 years ago

Describe the bug When leaving CFE running for long periods of time, I am seeing occasional errors related to bad message IDs/Commands appear.

For instance: EVS Port1 66/1/CFE_EVS 5: Invalid command packet, Message ID = 0x00001810

I was actually able to catch one of these in a core dump, and I can confirm that SBBufPtr->Msg looks like a perfectly well-formed message with MsgID of 0x1810. This particular example was caught at https://github.com/nasa/cFE/blob/56397a3a2c47c674892c14610e78263030441d73/fsw/cfe-core/src/evs/cfe_evs_task.c#L378.

(gdb) print /x *((CFE_MSG_CommandHeader_t*)SBBufPtr)
$2 = {Msg = {CCSDS = {Pri = {StreamId = {0x18, 0x10}, Sequence = {0xc0, 0x0}, Length = {0x0, 0x1}}}, Byte = {0x18, 0x10, 0xc0, 0x0, 0x0, 0x1}}, Sec = {FunctionCode = 0x0, Checksum = 0x0}}

What is particularly mysterious is that 0x1810 is CFE_TIME_TONE_CMD_MID ... These MIDs should be delivered only to CFE_TIME - EVS does not subscribe to these - yet for some reason it appears to be delivered to EVS....

To Reproduce Build CFE in default debug config (SIMULATION=native, etc). Start up CFE core and just let it run. It may take several hours for the first error to appear. CFE appears to continue running normally afterwards though.

Expected behavior EVS (or other apps!) should not randomly see messages it didn't subscribe to.

System observed on: Ubuntu 20.04 (native)

Additional context Not sure when this started because everything seems fine when debugging CFE for shorter time periods. But over the last couple weeks I've noticed these messages randomly appearing in terminal windows where I've started CFE and forgotten about it (overnight or longer).

Although I cite EVS here (because its the one I actually caught in a core file) this phenomenon occurs in other apps too (SB, TIME, etc) all randomly getting a delivered message ID they didn't subscribe to.

Reporter Info Joseph Hickey, Vantage Systems, Inc.

jphickey commented 3 years ago

I am starting to dig deeper into this - don't have any more info just yet - but was wondering if perhaps other users have noticed this too? Given the intermittent nature and that it takes hours to reproduce this, it might be a while before I can track it down sufficiently.

Seems like it could be a race condition where the buffer is perhaps reused before it was freed, or the routing in SB delivered it to the wrong place somehow.

astrogeco commented 3 years ago

Might be worth sending a message to the mailing list. I'll post on the internal MS Teams channel.

jphickey commented 3 years ago

I am pretty sure it is something fairly recently added. As I said I cannot pinpoint exactly when (yet) but I do frequently leave CFE running in one of the many terminal windows buried on my desktop - and this didn't happen before. So only users of bleeding edge dev builds might see this - Not official release users.

And for users that do have a build of the latest code and have the ability to let it run for an extended period - please try it and comment if you see this too. I am wondering if it might be affected by multi-core systems or something like that.

skliper commented 3 years ago

Ran overnight (SIMULATION=native on Ubuntu 18.04 VM) w/ no occurrences yet. Just the flywheel toggle.

skliper commented 3 years ago

I'll do a debug build with the CFE_TIME_TONE_CMD_MID sent at a high rate, maybe some extra reporting to see if I can increase the frequency of the bug.

jphickey commented 3 years ago

I also ran two instances overnight (in VM) - one was a vanilla/unmodified "main" build and the other had debug instrumentation added. Since last night the unmodified build has accumulated 3 errors:

1980-012-14:11:19.65671 ES Startup: CFE_ES_Main entering OPERATIONAL state
EVS Port1 66/1/CFE_TIME 21: Stop FLYWHEEL
EVS Port1 66/1/CI_LAB_APP 4: CI: invalid command packet,MID = 0x1810
EVS Port1 66/1/CFE_SB 43: Invalid Cmd, Unexpected Msg Id: 0x1810
EVS Port1 66/1/SAMPLE_APP 5: SAMPLE: invalid command packet,MID = 0x1810

Of course the build with debug instrumentation added has accumulated none yet :-/....

skliper commented 3 years ago

Typical. I updated HK messages to "1 Hz" in sch_lab and then I made "1 Hz" 100 Hz from the time service timer config, no errors yet but hopefully I can trigger one... all we send is HK messages and the time at tone message (that I can think of) and it happens in a burst. I haven't reviewed priorities in a while...

jphickey commented 3 years ago

OK - I was able to catch several of these incidents and get core files of the system state when it happened, along with one where I had also started the perf log. With another day of runtime, both of my builds eventually accumulated some errors.

This core files confirm that SCH_LAB was actively running and generating the SEND_HK requests, while CFE_TIME "tone task" was also actively running to generate the CFE_TIME_TONE_CMD_MID (0x1810) message. Ultimately it looks like the same buffer was issued to both messages at the same time.

After doing some inspection on CFE SB, it appears to be missing locks here. In particular getting the buffer from the pool used to be done while holding the SB global lock prior to the recent SB refactor, but in the current CFE_SB_TransmitMsg it is not locked. Incidentally the inverse operation (CFE_SB_DecrBufUseCnt) is still done under lock.

I also see other areas of SB that are accessing globals and therefore should be locked, too... will need to a scrub of this code and make sure it has all the locks it needs.