nasa / CF

The Core Flight System (cFS) CFDP application.
Apache License 2.0
79 stars 45 forks source link

Possible race condition with creation of throttle sem (needs to exist before CF initializes) #184

Closed jphickey closed 1 year ago

jphickey commented 2 years ago

Describe the bug CFE ES starts all applications in their own thread. Therefore, conceptually at least, all apps are starting at the same time.

If configured to use a throttle sem, the CF app expects that semaphore to be created before it starts. During startup, it will attempt to bind to that semaphore during CF_CFDP_InitEngine(), and if that fails, CF aborts (see #178).

Problem is, if the semaphore is created by another app, whether it be CI/TO or some other dedicated I/O app, there is no guarantee that the semaphore has been created before CF attempts to use it.

Secondary problem exists if the I/O app that owns the sem gets restarted or reloaded, the semaphore ID will likely change too. This may be recoverable by disabling the engine and re-enabling it (but haven't tested that).

To Reproduce Its a race condition, so not readily reproducible. Start CF before the app that creates the sem (still not guaranteed, but increases the chance the race will be lost) Add an artificial delay during startup for the app that creates the sem (just further increases the chance the race will be lost)

Expected behavior Should be guaranteed via sync mechanisms, or shouldn't be a hard error (e.g. maybe retry to bind later?)

Suggestion that CF might still start up but with the engine in a disabled state, so at least someone can correct the condition and enable the engine, rather than having CF abort/exit.

Adding a call to CFE_ES_WaitForStartupSync() before starting the engine might help too...

Code snips https://github.com/nasa/CF/blob/2a024d8efd9f44f54b7a59f7face24f255536236/fsw/src/cf_cfdp.c#L1014-L1015

System observed on: Ubuntu 21.10

Reporter Info Joseph Hickey, Vantage Systems, Inc.

jphickey commented 2 years ago

calling this a bug because its a race condition, although simply starting CF after the I/O app seems to avoid hitting it most of the time, it's lurking there nonetheless.

jphickey commented 2 years ago

I'm running into this issue with the integration of BP + CF ... BP creates the throttle sem, and although CF is started after BP, it still hasn't been created by the time CF checks for it so it fails out. Can hack it to get around the race for now, but we probably need an actual solution to this.

jphickey commented 1 year ago

This old issue is becoming a real pain point for BP startup, it hits this race quite often and CF bails out.

I'm going to submit a PR with a possible workaround.

skliper commented 1 year ago

Just wait until the sem is created? Loop w/ a 1 hz sleep?

jphickey commented 1 year ago

Just wait until the sem is created? Loop w/ a 1 hz sleep?

Yes, pretty much.... I had been locally using a patched/hacked version of CF with a sleep but now I'm making some github workflows and it would be much cleaner if there was something in the main line to deal with the sem creation.