RCU + Sync/async reads.

a-sevin commented 6 years ago

I'm trying to find a standard, safe and efficient mechanism to handle circular buffers.

Julien found this one which seems to be a good candidate : https://lwn.net/Articles/262464 https://wiki.linuxfoundation.org/realtime/documentation/technical_details/rcu

drbitboy commented 1 year ago

Cf. this example. It takes four semaphores, two used as integers to number the contents of the circular buffer so it is known when the buffer is empty (consumers must wait) and when the buffer is full (producers must wait), and two used as binary mutexes to control the indices (pointers) to the head and the tail of the circular buffer, to do it.

The current ImageStreamIO has only an array of functionally-equivalent semaphores, the purpose of which is not yet clear to me.

We don't need to control the elements in the circular buffer, we need to instead control pointers to the head and tail of the circular buffer.

Sidebar: I have yet to find an INDI driver (application) in MagAOX that actually uses the circular buffer: all shmimMonitors (shmim consumers) instead skip over the earliest elements (images) and uses only the latest element, "flushing" the shmim semaphore* to 0.

* roughly equivalent to the inverse of semaphore b->empty in the example link above

drbitboy commented 1 year ago

Going forward, are we agreed that ImageStreamIO, as it exists today, does not implement any thread-safe circular buffer control, other than perhaps enabling producers and consumers to detect when the circular buffer is full and empty, respectively?

oguyon commented 1 year ago

Circular buffers are seldom used. We plan to use them for data logging to relax timing requirements on the data loggers. For our real-time applications, we generally do not have use for circular buffers. If a consumer is late, we do not want to process an old frames - it is better to skip missed frames and process the newest frame. We do need a way for the consumer to know that it has missed frame(s) and this is detected by the frame counter and the semaphore value. If a frame is being written while the consumer gets to it, it is generally better to process immediately (this is the default behavior), even if part of the frame is old and part is new. Alternatively, the consumer may choose to wait for the new frame to be fully written. This is a choice on the consumer side, reading the "write" flag.

DasVinch commented 1 year ago

Quick addition: it's completely possible to foresee using ISIO as an IPC mechanism where synchronization is necessary (but, that's currently mostly out of scope). In that case, in the fashion of the write flag, we could add a read flag set by the consumers. And in such a configuration, it would be up to the producer to respect the read flag and withhold writing (or, in a more elaborate scheme, circular buffers...). It's simplistic, but it maintains the authority of the producer on the SHM object in all cases.

jaredmales commented 1 year ago

I detect some mismatch in what we're talking about as circular buffers.

@drbitboy the multiple semaphores are to support multiple consumers -- each consumer watches its own semaphore uniquely. This has nothing to do with circular buffers.

We do in fact support ImageStreamIO circular buffers in MagAO-X as I think @oguyon is referring to, including in the shmimMonitor CRTP base class. This is what cnt1 is used for in ImageStruct (not semaphores): https://github.com/magao-x/MagAOX/blob/f37eb89488b41931f178b11835d9d8e1c7c1a9ae/libMagAOX/app/dev/shmimMonitor.hpp#L611. We also support circular buffers (wrapping of cnt1) in all of our producers.

@oguyon we do use the cnt1 based circular buffers. We have found in the past that in some cases the downstream CACAO process don't respect it so some WFS cameras don't. But we use ImageStreamIO for our science cameras where we have the simultaneous requirements of no data loss and low latency for WFS&C. As we move towards full telemetry post-processing, all cameras become science cameras so this will be more important elsewhere.

What do we mean here by "thread safe circular buffers"? There is absolutely no concept of multiple writers to the same stream in ImageStreamIO. So there is only one process ever needing to decide where to write next. The issue I worry about is that there is a time when the writer has updated cnt1 but not incremented its semaphores when things could get confused. Similarly the updating of the write flag and incrementing the semaphores is not atomic.

oguyon commented 1 year ago

@jaredmales - thanks for clarifying the distinction between circular buffers and semaphores.

Handling race conditions with "non-atomic" stream updates: Update operations should be performed in a specific order to address your concern about non-atomic stream updating. The convention order is :

embed procinfo metadata in stream (this is a circular buffer)
update data circular buffer
increment cnt0
set write flag to 0
post semaphores

This ensures that all data has been updated when the semaphores are posted (last operation). The assumption is that if a consumer decided to use counter0 as the synchro mechanism, it will not care about the semaphores.

This is implemented by function ImageStreamIO_UpdateIm. At a higher level, all stream updates should be done by processinfo_update_output_stream, which calls processinfo_update_output_stream and also keeps a trace of processinfo metadata into the stream.

On the consumer end, processinfo_waitoninputstream will properly handle synchronization. At a higher level, this is embedded within the macro INSERT_STD_PROCINFO_COMPUTEFUNC_LOOPSTART.

I -think- that if we stick to processinfo_update_output_stream (producer) and processinfo_waitoninputstream (consumer), there is no undefined behavior or race condition.... but very good to double/triple check this ! This mechanism is at the core of milk and has to be bulletproof, so more pairs of eyes looking at it would be very good.

oguyon commented 1 year ago

For anyone confused about how streams, procinfo and FPS interact together, I recommend running "milk-tutorial" and following/reading the instructions. Also run "milk-streamCTRL", "milk-fpsCTRL" and "milk-procCTRL" to follow along.

jaredmales commented 1 year ago

the processinfo_* stuff is not in ImageStreamIO, rather it's in the CLI right?

oguyon commented 1 year ago

Correct. Totally fine to use ImageStreamIO without processinfo. (that's what pymilk does)

jaredmales commented 1 year ago

same for MagAO-X (I think our code was mostly written before processinfo reached its current form)

drbitboy commented 1 year ago

[from @jaredmales:] the multiple semaphores are to support multiple consumers -- each consumer watches its own semaphore uniquely. This has nothing to do with circular buffers.

I am very curious to hear how that works. I've been looking at the code for some weeks now and I still don't see it.

The point of semaphores is that multiple processes use one - not multiple - semaphore(s) as a proxy for access to some resource, or to pointers (such as ->cnt1, and maybe ->cnt0) to an array of resources, and the semaphore ensures, in the canonical case, that multiple processes cannot access any resource that is in the process of being modified.: either a consumer drives the (binary) semaphore to zero which stops the producer from modifying it, or the producer drives the semaphore to 0 which prevents the reader from reading it.

Normally all processes interesting in a resource should be both posting to and waiting for that resource's semaphore, but the logic I have seen has one side only posting and the other side only waiting.

And then there is the poor man's mutex in that non-mutex, non-thread-safe ->write flag.

oguyon commented 1 year ago

@drbitboy this may be useful: https://subarutelescope.org/staff/guyon/publications/2018/2018-06-11_SPIE/cacao/cacao.pdf

flip through slides 5+

drbitboy commented 1 year ago

Okay. So on Slides 8 and 9, the frameGrabber posts to all semaphores which declares "release the hounds" i.e.

the second-stage processes (data logging, send, multiply) have their sem_(timed)wait calls succeed,
the used semaphore values drop to 0 on Slide 10,
those processes start accessing and processing the data (the resources for which those semaphores are proxies) in the WFS image shmim,
that processing continues on Slides 11-14, and
finally the next semaphores in the chain, the proxies for the incremental DM displacement shmim, are posted and increment on Slide 15.

I understand all that.

My question is this: where is the logic that stops the frameGrabber from writing new data to the WFS image shmim during that processing on Slides (10 and) 11-14, while those second-stage processes are reading from that same WFS image shmim?

Without logic to block the frameGrabber from writing to that shmim, that is not "using semaphores," and that code is not thread-safe.

drbitboy commented 1 year ago

With a one-image (-element) buffer (i.e. no circular buffer), and assuming* the primary issue is that a frameGrabber should not overwrite data in a shmim while a downstream process is reading it, here is how semaphores are meant to work: cacao_minimal_drbitboy.pdf; the key item to note is that both producers and consumers use both sem_post and sem_(timed/try)wait on the same semaphores.

One issue not covered there is initialization. Also, each downstream process (reader) will need to keep track of a timestamp or other incrementing value for each image, to ensure it does not try to grab (sem_wait) the semaphore and read the same image more than once.

* I understand that that assumption may not be accurate, but that is the problem semaphores are designed to solve. If the primary issue is something else, then semaphores may not be the solution. Because AFAICT the current code allows an upstream frameGrabber to run free and potentially overwrite data while downstream shmimMonitors are reading those same data. If that is not happening, then it is because the downstream shmimMonitors are fast enough to keep up with the upstream frameGrabber; it is not because of the current semaphore-based implementation.

DasVinch commented 1 year ago

Hey @drbitboy,

We're indeed running under the specification that what matters is that downstream consumers (C) cannot by any means stall an upstream producer (P). This is by design, and I 100% agree that this is an oddball in the software world.

My question is this: where is the logic that stops the frameGrabber from writing new data to the WFS image

Short answer: None, by design. Design and defaults can be reconsidered :)

P posts all semaphores when it's done writing data. Regardless of how many of these semaphores are being used by individual frame consumers. Let's call this one frame N.

Now, a few things can happen:

C is ready to fire and consumes the data within a very short time, and the semaphore goes back to zero. This is the ideal operating mode.
C is too slow in consuming the data... because it was kept busy somewhere else, scheduling mismanagement etc... 1/ C starts reading and is still reading while P starts writing frame N+1 in the buffer. Now this is a textbook race condition, and this is the one that can be addressed by an optional , atomic read flag, or a proper mutex. This is the case where an RCU mechanism could be somewhat worth the double copy cost. This is what I meant to mean in the above comment. 2/ C attempts to read the data while the producer is currently writing frame N+1. We hope to handle this case with the atomic write flag, so C observes that write==1, and waits til P is done writing. This then falls into the next case. 3/ C is ready to read fully after P has published frame N+1 and posted the semaphore again. In this case (except for internal circular buffers....), C observes that the semaphore value is now >1 and acknowledges that it has dropped a frame.

Of the 3 cases, 2 and 3 result in data loss, and only 1 opens the way for data corruption (now, this is a problem). Dropping the occasional frame in the AO loop is not too big of a problem. Data corruption is more of a problem.... even though one image and the next are really similar. It becomes a very serious problem with non-atomic writes in the data buffer (corrupted floats on some architectures, or corrupted strings in the keyword fields).

oguyon commented 1 year ago

In terms of AO performance, the best option if the consumer is late it so read while the data is being written. It is better to read a "corrupted" frame which has only be partially updated than to wait for a whole new one. Data corruption is better than data loss, because in this context, data corruption is partial data loss (part of a frame), while waiting for the write to be completed means we are loosing a full frame. This is the current default (as it should be).

On x86 systems, updating values in an array is atomic (unless the memory is unaligned) at the 8-byte level. So we should be safe with data types spanning 64 bit or less.

IMPORTANT: Should we support hardware platforms where updating values is non-atomic ? Do such systems exist ? @DasVinch points out that string updates are non-atomic, so if a real-time process uses such strings we need to cover this case. This is not easy... how would we detect a new write starting while we're reading a string ? (counter before read compared to counter after string... ? and then what do we do ?) We currently avoid making strings part of the real-time computations... or else... undefined behavior.

oguyon commented 1 year ago

correction: If write duty cycle is <<1, it is better to wait for the new frame as it is likely "just around the corner"

drbitboy commented 1 year ago

I would think that, if any downstream process cannot keep up, the best we can do is drop (i.e. skip) the (hopefully) occasional complete frame, because the only other option is corrupt data.

And if corrupt data are not a deal breaker, then why waste any effort with an implementation of semaphores that is not doing anything? Because the way it's coded now it is not.

oguyon commented 1 year ago

Missed frames should be low probability events, and we are making it a requirement that the consumer is nearly always on time. Typical missed frame rate should be ~< 1e-6. No strict requirement, but missed frames have a very negative impact on performance.

ImageStreamIO is first and foremost designed assuming that the consumers are significantly faster than the producer, but we still need to handle missed/late frames.

drbitboy commented 1 year ago

what is the significance of a semaphore value exceeding 1 (i.e. consumer misses a frame)? Or to put it another way, what if a consumer checks the value of a semaphore after completing a wait on that semaphore, and that semaphore's value is not 0?

Do we care? Because AFAICT the current code does not care.

sanfords commented 1 year ago

Anything > 0 is a signal to "eat all the data".

drbitboy commented 1 year ago

I guess what I am getting at is that the semaphore is probably only a little more efficient for the consumer than waiting for cnt0 or cnt1 to increment, but it is basically the same thing.

drbitboy commented 1 year ago

Anything > 0 is a signal to "eat all the data".

If the data are not in a circular buffer i.e. of length > 1, then there are no data to eat.

DasVinch commented 1 year ago

The original thinking was leveraging semaphore counts so that the reader can be informed it woke up late by accessing the semaphore and the semaphore only.

semaphore==1 immediately means the reader is kinda late but hasn't missed a frame just yet.
semaphore >=2 tells the reader it's already fully late to the party and has missed a frame or more. Of course all of that can be detoured by immediately checking cnt0 (and assuming you're not race conditioning between toggling the sem and checking the cnt0, but that hypothesis needs to stand anyway).

An interesting direction would be to check that the semaphore is still 0 as soon as a reader is done with all its accesses to the SHM. If >0, data corruption is possible and the reader may decide to toss the data. If ==0, then the faster-than-refresh-rate reader hypothesis is confirmed for this iteration.

Most of that could be bundled in the procinfo macros for reading the input stream of a process, or in new ImageStreamIO functions.

drbitboy commented 1 year ago

I just realized: when the location is the CPU and memory is shared, the image->array.raw memory is separate from the circular buffer memory. Also, the is no allocation for circular buffer if the CBsize parameter passes a 0 (or non-positive) value to ImageStreamIO_createIm_gpu.

Is that correct?

If yes, then the ->array.raw is the "working" memory that has the latest data from the producer, and the circular buffer is meant as an historical record or log, so in case the consumer falls behind behind the working memory, there might still be some past data in the circular buffer, perhaps representing several semaphore posting (incrementing) events?

This would also mean the semaphore is only for coordinating access to the ->array.raw memory, and there are no thread-safe traffic controls for accessing circular buffer memory.

Does that all sound right?

oguyon commented 1 year ago

This is all correct.

Rationale (to be discussed):

ISIO was optimized for speed and latency, so the working memory only holds the current frame to minimize memory footprint of RT processes and reduce cost of maintaining cache coherency. The working memory may be shared between a large number of processes (with one writer and multiple readers) We are assuming/requiring that readers are faster than writers, so there is no RCU implementation. RCU would (probably?) incur a cost in performance which we wanted to avoid. We currently have implemented a mechanism to prevent a process to read while the array is being written.

The CB buffer was created as a way to store recent data for non-RT process to grab data without missing frames. This is intended for logging data. The logger then doesn't need to have tight RT requirements, and can be late by a few frames and catch up.

We did not do any benchmarking. Happy to revisit these choices & assumptions. In particular, I wonder if we should enable "safer" modes (such as RCU) with optional features toggled by flags within the ISIO structure.

drbitboy commented 1 year ago

Thank you!

drbitboy commented 1 year ago

I am putting an operational test together i.e. more than "I can create a new, or open an existing, shmim." It probably has too many moving parts to be called a "unit" test, but it will be as minimal as possible.

milk-org / ImageStreamIO

RCU + Sync/async reads. #8