nasa / cFE

The Core Flight System (cFS) Core Flight Executive (cFE)
Apache License 2.0
409 stars 202 forks source link

SysLog Support for Incremental Reads #956

Open jsjoberg01 opened 3 years ago

jsjoberg01 commented 3 years ago

The concept is to use logs for forensic analysis, so yes as debug, but not as an exception more as always there. Problem is, by the time you know you want a log it will be gone if you don’t get it to a persistent, accessible storage place.

Algorithm I’m trying to support is: On Timer:

The current API supports reading all the logs in the buffer at once and then clearing the logs (with the race condition that any logs that occur during the reading out will be lost, OR need to lock out all loggers during the read which is not acceptable).

The proposed update to the API is the addition of a single function that instead of setting the Buffer size based of EndIdx, sets the Buffer size based on the difference between the Buffer->LastOffset passed in and the writeIdx. This is a traditional circular buffer interface and is well supported within the code as it stands, just needs an API to initialize the Buffer correctly.

Advantages are:

jphickey commented 3 years ago

The current API supports reading all the logs in the buffer at once and then clearing the logs (with the race condition that any logs that occur during the reading out will be lost, OR need to lock out all loggers during the read which is not acceptable).

Have you identified a race condition in the existing ES syslog implementation here? If so I think that warrants a separate bug ticket for investigation.

jphickey commented 3 years ago

My general recommendation for a persistent syslog is to leverage a PSP-based solution if possible, to write all OS_printf() information into whatever form of persistent storage that your platform offers that is best suited for the task. For instance on Linux these can be piped to the system log daemon - which present it in conjunction with all other system activity (useful) and also should automatically compress/archive old logs per its configuration, or even can be sent off-board to an external archive. This may require some mods to the PSP/OSAL to implement, but it really isn't complicated at all to do.

Note that CFE ES APIs are generally governed by the CFE requirements document, so new functionality and features at this level will require a lot more process to implement (stakeholder buy-in, requirements, rationale/use cases, unit tests, etc). Not that it can't be done .... but solutions that can be localized on a platform via PSP are much easier to implement in this regard.

jsjoberg01 commented 3 years ago

I believe this additional function is covered under the same requirement that drives the CFE_ES_SysLogReadStart_Unsync function, as it is a sister function.

The PSP solution would also lose any cFS logs/events only only capture my application, which is not desirable for forensics. Besides I would rather leverage all the work and coding that has been put into cFS as apposed to doing a one-off solution .

jsjoberg01 commented 3 years ago

The current API supports reading all the logs in the buffer at once and then clearing the logs (with the race condition that any logs that occur during the reading out will be lost, OR need to lock out all loggers during the read which is not acceptable).

Have you identified a race condition in the existing ES syslog implementation here? If so I think that warrants a separate bug ticket for investigation.

I'm not sure it is a bug with the code so much as it is a bug with the model. The code takes out a lock on the shared data and copies important information into the local buffer, so that is good. But, per the model, the only way to know what has been read and what has not been read is to call SysLogClear after all the Reads. That interim time between the initial call to ReadStart and the subsequent SysLogClear could have calls to Append or Write (in a multi-threaded system) and those logs will get lost on the call to SysLogClear. The only way to stop the loss of logs on a multi-threaded system in the current model is to hold the lock during the entire process of reading the logs out and block all loggers.

The model is fine for a single threaded system, or one that can tolerate blocking of all threads.

jsjoberg01 commented 3 years ago

Moved into comments section of Issue 956 so we aren’t duplicating effort 😊

Jon Sjoberg EO Vista 42 Nagog Park, Acton, MA 01720 Cell: 978-857-9314

From: Joseph Hickey notifications@github.com Sent: Tuesday, October 20, 2020 12:49 PM To: nasa/cFE cFE@noreply.github.com Cc: Jon Sjoberg jon.sjoberg@eovista.com; Author author@noreply.github.com Subject: [External]Re: [nasa/cFE] SysLog Support for Incremental Reads (#956)

[External]

My general recommendation for a persistent syslog is to leverage a PSP-based solution if possible, to write all OS_printf() information into whatever form of persistent storage that your platform offers that is best suited for the task. For instance on Linux these can be piped to the system log daemon - which present it in conjunction with all other system activity (useful) and also should automatically compress/archive old logs per its configuration, or even can be sent off-board to an external archive. This may require some mods to the PSP/OSAL to implement, but it really isn't complicated at all to do.

Note that CFE ES APIs are generally governed by the CFE requirements document, so new functionality and features at this level will require a lot more process to implement (stakeholder buy-in, requirements, rationale/use cases, unit tests, etc). Not that it can't be done .... but solutions that can be localized on a platform via PSP are much easier to implement in this regard.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://r.xdref.com/?id=09KGnLhZ001620&from=noreply@github.com&to=jon.sjoberg@eovista.com&url=https://github.com/nasa/cFE/issues/956#issuecomment-712998215, or unsubscribehttps://r.xdref.com/?id=09KGnLhZ001620&from=noreply@github.com&to=jon.sjoberg@eovista.com&url=https://github.com/notifications/unsubscribe-auth/AROK7IPSDGGZNPJ737JO4VDSLW5RBANCNFSM4SYOUQ7A. [ { "@context": "http://schema.org", "@type": "EmailMessage", "potentialAction": { "@type": "ViewAction", "target": "https://github.com/nasa/cFE/issues/956#issuecomment-712998215", "url": "https://github.com/nasa/cFE/issues/956#issuecomment-712998215", "name": "View Issue" }, "description": "View this Issue on GitHub", "publisher": { "@type": "Organization", "name": "GitHub", "url": "https://github.com" } } ]

jphickey commented 3 years ago

Yes - I do concur that the with the way the "Clear" is actually a separate command from the "Write" that there is an opportunity to miss any messages that get written between these two operations.

I would consider this a bug - whether its a bug in the model or code, it's still a bug - as we the should at least have an option for an atomic (at least WRT to the log) dump+clear command to avoid this hole. Should write a separate issue about this, so we can decide if that window is worth closing or not.

jphickey commented 3 years ago

The PSP solution would also lose any cFS logs/events only only capture my application, which is not desirable for forensics. Besides I would rather leverage all the work and coding that has been put into cFS as apposed to doing a one-off solution .

No, I'm not talking about making a new PSP function, I'm talking about redirecting all output done via OS_printf() - which includes syslogs - to the destination of your choice. This is fairly easy, and it will get you all log messages, and doesn't require modification of ES to do so. You would still be using all of ES log functions as they are - just adding a platform-specific "tee" to also write all your logs to persistent storage, not just the console.

skliper commented 3 years ago

Is this intended for within the context of an App? If so I second Joe's recommendation of taking a look at using events (specifically debug events). They provide both reporting via telemetry and recording to file, along with filtering, support of event types, capability to trigger other actions since they are a message, etc. Syslog is more focused on either before event services is up and running, before the app has registered with event services, or outside the context of an app.

jsjoberg01 commented 3 years ago

" just adding a platform-specific "tee" to also write all your logs to persistent storage, not just the console."

Sorry, by "persistent and accessible" I meant terrestrial storage.

skliper commented 3 years ago

If you do consider using events, the Data Store (DS) app can write the events to files, File Storage (FS) helps with file manipulation, CCSDS File Transfer Protocol App (CF) helps in moving files off the target. Typically cmd/tlm utilities record events sent also if you don't want to store on the target, with tlm being store and foward typically when not available (so no messages lost).

jsjoberg01 commented 3 years ago

In truth this is as a bug against CFE_ES_SysLogReadStart_Unsync. CFE_ES_SysLogReadStart_Unsync works fine when using LogMode_DISCARD, however in LogMode_OVERWRITE it is wrong to base SizeLeft on EndIdx, since the actual size is based on the wrapping WriteIdx. Downside of this view is it causes changes to currently working and fielded code, but it is The Right Answer.

skliper commented 3 years ago

In truth this is as a bug against CFE_ES_SysLogReadStart_Unsync. CFE_ES_SysLogReadStart_Unsync works fine when using LogMode_DISCARD, however in LogMode_OVERWRITE it is wrong to base SizeLeft on EndIdx, since the actual size is based on the wrapping WriteIdx. Downside of this view is it causes changes to currently working and fielded code, but it is The Right Answer.

Could you provide the steps to create the bug you are seeing, and actual output showing the issue? SizeLeft based on EndIdx (minus the mangled message part that gets subtracted) seems right to me since it's the "high water mark" even after the wrap. As in there's SizeLeft number of bytes in the log that need to be output (which is true before or after the wrap, as long as the mangled part is subtracted). Note we do functionally test this and errors haven't been reported, so either we are missing a scenario (which happens) or it's working as intended.

jsjoberg01 commented 3 years ago

Sorry, I'm not currently in a position to show output.

The steps to reproduce are:

Since the buffer is an overwrite/wrapping buffer, the size left should be the number of un-read logs not the total buffer size.

skliper commented 3 years ago

Oh, that the current expected behavior if you don't clear the log (it's a high water mark, not new messages since last read), and if you request another dump it'll do the 30 new ones plus the old 2. Although it's actually an index, not number of entries but that's a different topic. So this request sounds more like an enhancement to change how the log works vs a "bug". Thanks for the background. Your use case is different than the classic use case of a dump being the entire log every time.

skliper commented 3 years ago

Another side note - it defaults into overwrite for power on reset and discard for warm. Even though it's set to discard, it'll continue writing over the old messages until hitting the end of the buffer. That's probably less than desirable since the point of changing to discard is to maintain the old messages so they can be dumped.