Open ggggggggg opened 2 months ago
Note that in the File.Sync()
example, the actual sync is done in dsp.Flush()
, which also has to be modified.
I think that doing the Flush operations in parallel is probably a good optimization even if you are not doing a Sync (I was actually more-or-less following some notes in the code that suggested that they should be done concurrently).
Also just want to mention here that O_SYNC is a posix-only flag, so this would be a less-compatible option.
Good point, I was hoping since it is in the os
std library it is cross platform, but chat GPT says the O_SYNC
flag is silently ignored on windows. So maybe sync calls are the way to do. All O_SYNC
does it call Sync
on every Write
I believe, so they should be about the same.
https://chatgpt.com/share/f4d39e8a-0901-40ef-a310-19176958a9e7
I'm not worried about Windows ignoring the O_SYNC
flag. Isn't that the behavior we'd want anyway, given that Windows can't make use of the OS feature? I feel it quite unlikely that we'll be running Dastard on Windows ever, but I'm even more certain that we won't use Windows for our high-performance use cases like BESSY and NSLS-II.
LANL has said they want to run it in windows for HXI, so if there is an easy solution here that is cross platform I think we should pick it over a one platform solution. We've had problems on single PCs (BESSY) as well, so I'm not sure this is just for "high performance" cases.
To use calls to
File.Sync
Jamie recommends
I don't understand this code snippet. It appears pretty similar to what's already there. What is added/modified? Is the wait group the only thing that's new?
I agree with @ggggggggg that if opening a file with a new flag (O_SYNC
) magically makes writes (or flushes?) synchronize in the way that we need, then that seems simpler. More importantly, it seems more likely to survive some future code-editor removing it in a misguided attempt to make the code simpler, or faster, or "better" in some vague way.
But I admit this entire discussion has gone on far too long for me to catch up now. Perhaps I am making the solution out to be simpler than it really is?
Ohh yeah the snippet I posed lacked context... it's actually super unclear. We need to call File.Sync()
in the snippet, it's calling dsp.Flush()
asynchronously, and dsp.Flush()
is calling File.Sync()
. Jamie was arguing to use goroutines to call File.Sync()
since that may block the processing less.
I think we can catch you up in one step. In my testing at BNL, I made changes in two places:
ljh.go
// Flush flushes buffered data to disk
func (w Writer) Flush() {
if w.writer != nil {
w.writer.Flush()
w.file.Sync() // This is the new line that writes the OS file cache to disk before continuing
}
}
and
data_source.go
tStart := time.Now()
for idx, dsp := range ds.processors {
segment := block.segments[idx]
segment.processed = true
dsp.TrimStream()
if (idx+ds.readCounter)%20 == 0 {
wg.Add(1)
go func ( dsp *DataStreamProcessor) {
defer wg.Done()
dsp.Flush()
}(dsp) // This is a new go func that calls the modified Flushes concurrently
}
}
wg.Wait()
For me, both steps helped, though the first step was sufficient. The Sync() in ljh.go is to ensure that data gets flushed to disk, which prevents it from piling up in the OS cache, which is usually too large for our purposes. For reference, our cache is only flushed every 30 seconds, which can mean 6GB of data pile up and then get written all in one go...
The second step calls these Sync() commands are called in parallel. Each sync can take about 50 ms (or more), and if you do them in series, you'd get 600 ms of waiting on every pass through the loop. Seemingly, you can Sync all the caches in parallel, and it doesn't take much more than 50 ms total.
These are the two modifications I made to get things running stably at NSLS-II. I would be very curious to see you test them on your local systems (including writing to a spinning disk, not the network).
We've long had issues when writing to certain drives, such as network drives as NSLS2 or a spinning disk a BESSY. In these cases dastard sees very high latency when writing to disk, and once it's buffers fill, dastard panics. Jamie has found that using regular calls to
File.Sync()
OR using theO_SYNC
flag when opening files resolves this issue at NSLS2.The reasoning is that writes to file have many buffers, we have buffers in dastard, there are memory pages in the OS, then it can get flushed to the "file system" which may have yet more buffering, then it could be delivered to a disk which may also have a write buffer, then it could be committed to durable memory. This buffering seems to be bad for high throughput writing, so we want to essentially opt out of it. Jamie found that his network traffic while writing to a network drive sat at 0 for long periods of time, presumably due to buffering. It may be that this should be optional, because maybe it is bad sometimes.
To use
O_SYNC
, see https://cs.opensource.google/go/go/+/refs/tags/go1.23.0:src/os/file.go;l=373To use calls to
File.Sync
Jamie recommendsMy gut says
O_SYNC
is simpler and would be the preferred choice.