sourcefrog / conserve

🌲 Robust file backup tool in Rust
Other
258 stars 21 forks source link

Antivirus blocking file IO causes a backup to hang #172

Closed WolverinDEV closed 2 years ago

WolverinDEV commented 2 years ago

Hey, I ran conserve over night on my D:\ drive which seems to hung up after around 363gb.
I've attached a process dump as well as the according .pdb file.

On first analysis, fs::OpenOptions::open seems to stall forever.
The reason in unknown.

Attachments:
220803_conserve_pdb_v0_6_15.zip 220803_conserve_dmp_v0_6_15.zip

Additional Info:
rustc --version: rustc 1.62.0-nightly (cb1219871 2022-05-08) I've build conserve in release mode.

WolverinDEV commented 2 years ago

Update:
Tried again and it hung up on the exact same file:
D:\\Users/WolverinDEV/Downloads/2021_08_07_Downloads/Mad Robots_1.23D.exe

After some googling it seems that CreateFileW might stall forever/a long time under certain circumstances.
These need to be evaluated.

I'll try to debunk this this evening.

sourcefrog commented 2 years ago

Hey, thanks for the report. I infer this is while making a backup, not a restore?

My guess would be that this is connected to Windows antivirus preventions, which can sometimes get very backlogged in processes that are doing heavy IO.

Does it eventually make progress if you leave it alone or does it stay stuck forever?

Is Windows Defender on or is there a third-party AV? If you exclude the backup directory from scanning does that help?

WolverinDEV commented 2 years ago

Uhh, the AV might be a good catch. I got an outdated G-Data installation. I also know that I disabled IO-Monitoring (Especially if you're building a lot IO-Monitoring really hits performance). Only file scan is activated.

It's stuck "forever" as long > 8h counts as forever. Nevertheless is this something to avoid :)

WolverinDEV commented 2 years ago

Update: In fact was G-Data blocking the file access.
Since it ran at night I didn't saw the popup dialog and G-Data seemed to not ask again.
What's even more funny is, that because G-Data was blocking the file my whole PC crashed.

But the takeaway still exists imo.
Somehow (and I don't know how, yet) try to avoid deadlocking when anti virus software stalls file access.
Maybe put this in a new issue though.

sourcefrog commented 2 years ago

I agree this is a bad user experience and it would be good to somehow avoid it, but if something at the system level is blocking file IO that seems difficult to deal with...

Maybe we could print a message if a particular file IO is taking a long time. But that would introduce some overhead on all backups to deal with this edge case...

sourcefrog commented 2 years ago

To make this actionable we could

I think @rbtcollins previously hit something like this in connection with rustup?

rbtcollins commented 2 years ago

I think in general one should remember that no sync IO call is guaranteed to take a particular time in everyday OS usage.

NFS, fuse, various anti malware layers, failing media, can all produce this on Linux, BSD flavours including Darwin, and with the exception of fuse, Windows.

Windows has a sampling bias that makes it seem much more fragile here, but I think it's really equivalent once the long tail of behaviours on other OSes is considered.

I think just blocking is fine, it is after all what most programs do. Showing that the program isn't itself broken is useful too, for instance a byte rate counter, eg showing overall read rate, min read per file, max read per file, using decaying counters... That kind of thing.

sourcefrog commented 2 years ago

Yep, agreed. There is a progress bar, although it too might stall if syscalls block for a long time.

I added this in https://github.com/sourcefrog/conserve/wiki/Troubleshooting. I don't think it's actionable aside from that. Complain to the AV vendor 😄

rbtcollins commented 2 years ago

Just to add the rustup experience as a datapoint: we found that close/CloseHandle calls were blocking, which is one of the calls no-one intuitively expects to be slow. They turned out to be slow because write calls had to complete their AV processing before the dirty pages could be released, and on Linux the process object owns the dirty pages, not the OS - and they are tracked via the handle, thus closing it has to block.

As we dug through it, we found slow calls on stat / listdir, and chmod and the like for NFS from Linux installs.

We were able to do something about the impact on performance here: we deferred all these calls - and the logic behind them - to worker threads, which the OS could block to its hearts content, and set a memory limit on the amount of deferred work. Fast IO situations do pay a tax to do this, but as we're dealing with 10's of thousands of files, the return is substantial.

Backpressure from the system shows up as memory pressure, but we can actually pressure the system properly, which looks different depending on the specific situation - IO limits, CPU from AV, NFS network latency.

Some of these techniques are relevant to conserve's backup logic. Possibly deferring directory listing, metadata retrieval, and file opening would be useful, and certainly running many read() syscalls in parallel will likely get substantially more throughput from fs's where readahead on a single file cannot keep the CPU fed. But I haven't done any profiling work on conserve to see if that is needed.

rbtcollins commented 2 years ago

Oh and one more final thing, Rust doesn't do *_at calls today, so every open will do directory traversal for all elements you've given it - from full path if absolute paths are in use, or just all the relative segments if not - but unless chdir is being done, and then relative paths used, then that cost is incurred. From bzr we know that this is non-trivial. Its possible that this is something that a no-op backup might benefit from.

sourcefrog commented 2 years ago

Thanks! Yep, loved your talk on this.

It does do some work to spam out lots of IO from separate threads, which I think (and have even measured) works well in many situations: local SSDs with deep queues, spinning disks with high latency that can sort out head movement across many files, and remote filesystems. But there is lots of room to do more. Yay Rust.

Message ID: @.***>

sourcefrog commented 2 years ago

Yep, _at calls would likely help. chdir is complicated when we're reading and writing things across many directories in different threads, though. But probably there is lower hanging fruit to just fill all the cores with threads and fill all the IO queues.

If/when I get to this I might look for, or even write, a Rust crate that binds to _at calls on Linux. I think the slightly tricky bit will be doing it in a way that is clean and fast on both platforms that do and don't have this. Another exciting possibility in that area would be something on io_uring when safe abstractions over it are ready.

rbtcollins commented 2 years ago

So for _at calls, there are currently three crates. openat, which just does unix. cap-std, which is a whole new model, nice and fast on Linux, but quite syscall heavy on everything else (and doesn't implement _at calls correctly on Windows), and https://crates.io/crates/fs_at, something I've prepared earlier for you :P.