jacereda / fsatrace

Filesystem access tracer
ISC License
78 stars 12 forks source link

Better report buffer overflow failures #47

Closed jayzhuang closed 2 years ago

jayzhuang commented 2 years ago

@woody77

jacereda commented 2 years ago

Have you experienced buffer overflows after the recent buffer size increase? If so, do you have an estimation on what the output size was?

jayzhuang commented 2 years ago

Yes, @woody77 saw buffer overflows while tracing a command accessing a ~780K files. We use fsatrace to trace Fuchsia build, and we suspect this is causing flakes in the build, so want to add better diagnostics to confirm.

Also I'm wondering whether it'd be good to dynamically grow the buffer? I'm experimenting with that in https://github.com/jayzhuang/fsatrace/commit/e52fb4d51bf6da566c5d1e2e5469a56f409d68f9

woody77 commented 2 years ago

I put this patch together (or a version of it) for @jayzhuang based on realizing that I was seeing a buffer overflow. I pushed LOGSZ up to 64MB from 16MB, and saw that we were capturing a 29MB trace.

There's likely a separate bug as to why the traces are so large, but the way it overflows and crashes the task that it's tracing makes it really hard to understand what's going on.

And so some protections around that (either by stopping tracing and then logging that the buffer was filled), or by growing the buffer, would be good things to have.

What I like about this patch is that the check is very lightweight, and it doesn't create hard-to-diagnose failures (crashes) in the task that it's tracing.

I think I'd also like to make LOGSZ something that can be set by argument (since that should be straightforward to do).

I think I'd prefer a combination of those (argument to set the size of shm.buf to use, and a clean termination of logging when the buffer is full), vs. dynamically growing. At least in our case, because the need to increase the buffer is an indication that we have some other issues going on that need addressing.

jacereda commented 2 years ago

Yes, @woody77 saw buffer overflows while tracing a command accessing a ~780K files. We use fsatrace to trace Fuchsia build, and we suspect this is causing flakes in the build, so want to add better diagnostics to confirm.

I wouldn't mind increasing the buffer size to, say, 64MB. My gut feeling is it won't make a difference in build times.

But I'm having a hard time trying to figure out what kind of process can do that, are you sure there isn't a bug in such process causing unnecesary accesses?

Also I'm wondering whether it'd be good to dynamically grow the buffer? I'm experimenting with that in jayzhuang@e52fb4d

This tool is critical for my builds and I try to keep it as simple as I can.

The time I can devote to this is very limited and my needs are covered by the static sizes.

I understand Google's needs might not be aligned with mine, if you feel this is required on your side feel free to keep a fork.

jayzhuang commented 2 years ago

Thanks jacereda@, we really appreciate your time and the tool you built, it's really useful in our build and your timely responses are super helpful.

In Fuchsia's build, majority of the processes were happy with even a 1MB buffer, there are a few commands we are having trouble with, for example the @woody77 found the buffer overflow issue when running a Dart analyzer command. It's quite possible there are bugs in those tools and we are looking into them at the same time.

I agree with you, simplicity is better than complexity. Although in our case, keep bumping up a static buffer size is definitely not ideal. Totally understand our needs may not align, I'll discuss with folks on forking this repo.

woody77 commented 2 years ago

Yes, @woody77 saw buffer overflows while tracing a command accessing a ~780K files. We use fsatrace to trace Fuchsia build, and we suspect this is causing flakes in the build, so want to add better diagnostics to confirm.

I wouldn't mind increasing the buffer size to, say, 64MB. My gut feeling is it won't make a difference in build times.

Being able to set the size via the CLI might be a nice middle ground, and use a smaller default, allowing us to use larger buffers if we need to.

But I'm having a hard time trying to figure out what kind of process can do that, are you sure there isn't a bug in such process causing unnecessary accesses?

Oh, I'm pretty sure that there's a bug, but because it's crashing, we don't get to see the output. By switching to a graceful exit, users can better see what is going on when it runs out of memory.