usnistgov / dastard

NIST transition-edge sensor (TES) data acquisition framework
Other
12 stars 4 forks source link

ensure synchronized io for file writing #354

Open ggggggggg opened 2 months ago

ggggggggg commented 2 months ago

We've long had issues when writing to certain drives, such as network drives as NSLS2 or a spinning disk a BESSY. In these cases dastard sees very high latency when writing to disk, and once it's buffers fill, dastard panics. Jamie has found that using regular calls to File.Sync() OR using the O_SYNC flag when opening files resolves this issue at NSLS2.

The reasoning is that writes to file have many buffers, we have buffers in dastard, there are memory pages in the OS, then it can get flushed to the "file system" which may have yet more buffering, then it could be delivered to a disk which may also have a write buffer, then it could be committed to durable memory. This buffering seems to be bad for high throughput writing, so we want to essentially opt out of it. Jamie found that his network traffic while writing to a network drive sat at 0 for long periods of time, presumably due to buffering. It may be that this should be optional, because maybe it is bad sometimes.

To use O_SYNC, see https://cs.opensource.google/go/go/+/refs/tags/go1.23.0:src/os/file.go;l=373

f = OpenFile(name, O_RDWR|O_CREATE|O_TRUNC|O_SYNC, 0666) 

To use calls to File.Sync Jamie recommends

tStart := time.Now()
      for idx, dsp := range ds.processors {
            segment := block.segments[idx]
            segment.processed = true
            dsp.TrimStream()

            if (idx+ds.readCounter)%20 == 0 {
                  wg.Add(1)
                  go func ( dsp *DataStreamProcessor) {
                        defer wg.Done()
                        dsp.Flush()
                  }(dsp)
            }
      }
wg.Wait()

My gut says O_SYNC is simpler and would be the preferred choice.

cjtitus commented 2 months ago

Note that in the File.Sync() example, the actual sync is done in dsp.Flush(), which also has to be modified.

I think that doing the Flush operations in parallel is probably a good optimization even if you are not doing a Sync (I was actually more-or-less following some notes in the code that suggested that they should be done concurrently).

Also just want to mention here that O_SYNC is a posix-only flag, so this would be a less-compatible option.

ggggggggg commented 2 months ago

Good point, I was hoping since it is in the os std library it is cross platform, but chat GPT says the O_SYNC flag is silently ignored on windows. So maybe sync calls are the way to do. All O_SYNC does it call Sync on every Write I believe, so they should be about the same.

https://chatgpt.com/share/f4d39e8a-0901-40ef-a310-19176958a9e7

joefowler commented 2 months ago

I'm not worried about Windows ignoring the O_SYNC flag. Isn't that the behavior we'd want anyway, given that Windows can't make use of the OS feature? I feel it quite unlikely that we'll be running Dastard on Windows ever, but I'm even more certain that we won't use Windows for our high-performance use cases like BESSY and NSLS-II.

ggggggggg commented 2 months ago

LANL has said they want to run it in windows for HXI, so if there is an easy solution here that is cross platform I think we should pick it over a one platform solution. We've had problems on single PCs (BESSY) as well, so I'm not sure this is just for "high performance" cases.

joefowler commented 2 months ago

To use calls to File.Sync Jamie recommends

I don't understand this code snippet. It appears pretty similar to what's already there. What is added/modified? Is the wait group the only thing that's new?

I agree with @ggggggggg that if opening a file with a new flag (O_SYNC) magically makes writes (or flushes?) synchronize in the way that we need, then that seems simpler. More importantly, it seems more likely to survive some future code-editor removing it in a misguided attempt to make the code simpler, or faster, or "better" in some vague way.

But I admit this entire discussion has gone on far too long for me to catch up now. Perhaps I am making the solution out to be simpler than it really is?

ggggggggg commented 2 months ago

Ohh yeah the snippet I posed lacked context... it's actually super unclear. We need to call File.Sync() in the snippet, it's calling dsp.Flush() asynchronously, and dsp.Flush() is calling File.Sync(). Jamie was arguing to use goroutines to call File.Sync() since that may block the processing less.

cjtitus commented 2 months ago

I think we can catch you up in one step. In my testing at BNL, I made changes in two places: ljh.go

// Flush flushes buffered data to disk
func (w Writer) Flush() {
    if w.writer != nil {
        w.writer.Flush()
                w.file.Sync() // This is the new line that writes the OS file cache to disk before continuing
    }
}

and data_source.go

tStart := time.Now()
      for idx, dsp := range ds.processors {
            segment := block.segments[idx]
            segment.processed = true
            dsp.TrimStream()

            if (idx+ds.readCounter)%20 == 0 {
                  wg.Add(1)
                  go func ( dsp *DataStreamProcessor) {
                        defer wg.Done()
                        dsp.Flush()
                  }(dsp) // This is a new go func that calls the modified Flushes concurrently
            }
      }
wg.Wait()

For me, both steps helped, though the first step was sufficient. The Sync() in ljh.go is to ensure that data gets flushed to disk, which prevents it from piling up in the OS cache, which is usually too large for our purposes. For reference, our cache is only flushed every 30 seconds, which can mean 6GB of data pile up and then get written all in one go...

The second step calls these Sync() commands are called in parallel. Each sync can take about 50 ms (or more), and if you do them in series, you'd get 600 ms of waiting on every pass through the loop. Seemingly, you can Sync all the caches in parallel, and it doesn't take much more than 50 ms total.

These are the two modifications I made to get things running stably at NSLS-II. I would be very curious to see you test them on your local systems (including writing to a spinning disk, not the network).