maharmstone / btrfs

WinBtrfs - an open-source btrfs driver for Windows
GNU Lesser General Public License v3.0
5.77k stars 222 forks source link

Major performance degradation after period of use #346

Open lesderid opened 3 years ago

lesderid commented 3 years ago

I've been using WinBtrfs for the past couple weeks on my secondary disk, mainly for Steam games and file sharing (both sequential and random-access, both small files and large files).

Since I've started using it, I've been experiencing regular deadlocks, or at the very least major performance degradation, every so often, causing e.g. Windows Explorer to stop responding. The only 'fix' has been to hard reboot when it happens. It seems to happen whether there's a pagefile on the volume or not.

WinBtrfs version is 1.7.6, Windows 8.1 (latest build). Compression is enabled, but not forced, algorithm is zstd (default level), no other configuration changes. File system was formatted with mkbtrfs.exe. Hasn't bugchecked.

Possible duplicate of #58, #338, #345

lesderid commented 3 years ago

I tried collecting some data for debugging by copying (with COPY /B in a batch file) a 4GiB file with random data from NTFS to Btrfs on a Hyper-V VM. The guest has two disks, backed by files on the same NTFS HDD, one formatted as NTFS and one formatted with mkbtrfs.exe right before testing.

Guest (and host) running clean Windows 8.1 x64, with most inessential Windows services disabled, and Driver Verifier enabled with standard settings. I'm using a debug build of 04da8409ca7b56d809437d61db2f5a1a566f451a (with some logging changes). A release build with logging disabled didn't seem to make a difference.

WinBtrfs logs (split for easier viewing/filtering): https://files.lesderid.net/btrfs-windbg-logs.0.tar.gz

Locks: https://pst.moe/paste/oiwnoy

It looks like it starts writing in 1MiB byte chunks and then slows down to 4KiB chunks and occasionally stalls? Running the second log file through uniq -D | uniq -c:

  62177 [btrfs] FFFFE00046751040:fast_io_acquire_for_mod_write:C:\Users\lesderid\msys\repos\winbtrfs\src\fastio.c:275:(FFFFE00048BF8470, b74fd000, FFFFE00048A22DB8, FFFFE00048F64030)
 124048 [btrfs] FFFFE00046751040:fast_io_acquire_for_mod_write:C:\Users\lesderid\msys\repos\winbtrfs\src\fastio.c:275:(FFFFE00048BF8470, b7dc9000, FFFFE00048EF0608, FFFFE00048F64030)
 134375 [btrfs] FFFFE00046751040:fast_io_acquire_for_mod_write:C:\Users\lesderid\msys\repos\winbtrfs\src\fastio.c:275:(FFFFE00048BF8470, b9760000, FFFFE00048EF0608, FFFFE00048F64030)
 144459 [btrfs] FFFFE00046751040:fast_io_acquire_for_mod_write:C:\Users\lesderid\msys\repos\winbtrfs\src\fastio.c:275:(FFFFE00048BF8470, b9e24000, FFFFE00048EF0608, FFFFE00048F64030)
 143159 [btrfs] FFFFE00046751040:fast_io_acquire_for_mod_write:C:\Users\lesderid\msys\repos\winbtrfs\src\fastio.c:275:(FFFFE00048BF8470, aa9c9000, FFFFE00048EF0608, FFFFE00048F64030)
  18872 [btrfs] FFFFE00046751040:fast_io_acquire_for_mod_write:C:\Users\lesderid\msys\repos\winbtrfs\src\fastio.c:275:(FFFFE00048BF8470, aae77000, FFFFE00048EF0608, FFFFE00048F64030)

The behaviour seems to be a little different (maybe due to caching on the host or something) from a real machine, but there's clearly still something that's going wrong.

Edit: this was with compression (zstd) on, but disabling it doesn't seem to make a difference. Edit 2: I tried running sync64, but it just gets stuck waiting on one of the locks.

maharmstone commented 3 years ago

Thanks @lesderid, that's very useful. Could you try against 2818f989deeaa1ae13014fa857cc489e6dc81f64, and see if it's any better?

lesderid commented 3 years ago

Thanks @lesderid, that's very useful. Could you try against 2818f98, and see if it's any better?

No problem. That seems to have fixed it!

szaimen commented 3 years ago

I've just experienced the same issue last night but installed v1.7.7 and can gladly report that it fixes the issue :)

lesderid commented 3 years ago

I'm afraid it isn't completely fixed for me, I came back to a deadlocked system after running some IO intensive workloads overnight. I'll try to investigate.

lesderid commented 3 years ago

I think the problem might be that fast_io_acquire_for_ccflush/fast_io_release_for_ccflush don't do what they ought to do. Unlinking them from the FastIO dispatch (so it falls back to a default implementation) seems to work, but I'm not sure that's correct, and I don't know how I can consistently reproduce this.

Edit: nevermind, that doesn't fix it, looks like I just can't consistently reproduce it.

RigoLigoRLC commented 2 years ago

I'm afraid that I'm experiencing the same issue.

On my PC I have my NTFS drive C (230GB) and Btrfs drive D (700GB with 4 subvolumes). All partitions are on a high performance NVMe SSD. If I perform some IO intensive task on Btrfs volume (like compiling a large C++ project) the system will become gradually unresponsive. Mouse can move but the UI stopped processing clicking and hovering events, and the applications are not being blocked and are still running fine (for example, Discord in Firefox can still receive messages). When the compilation ends, the clicking events that were created during the hang will all be processed. But Windows UI performance will not restore and UI is going to be less responsive, and keep degrading over time. If I close the laptop lid the system sleeps normally, but usually won't wake up to LogonUI anymore, leaving a blue background with a mouse cursor.

System info: Windows 10 Pro 19043.1110 (21H1) Winbtrfs: 1.7.9

kwencel commented 2 years ago

@RigoLigoRLC what you are describing seems more like this issue. However, they all might be related.

lesderid commented 1 year ago

Has this been fixed (since April 2021)? I haven't been using WinBtrfs since then because of the instability.

I've been trying to reproduce it now in a clean WinPE environment and so far I haven't run into it, but I've never been able to reproduce this very reliably.

Small-Ku commented 1 year ago

I am facing the deadlock when I am playing Honkai: Star Rails.

Possible deadlock. Use !locks FFFF8A083BF07378 to determine the resource owner
DBGK: DbgkWerCaptureLiveKernelDump: WerLiveKernelCreateReport failed, status 0xc0000099.

DebugView seems show some log like this but I don't know what can be done further with the provided command in this log.

sumariva commented 6 months ago

OK. I will try help you reproduce the problem of slow performance. Go to the https://www.getmonero.org/downloads/#gui! Yes, it is me again with the crypto things. Get the monero GUI client for windows and install it. During install process, the Microsoft Windows Defender may warn of virus, tell it to ignore the files. Add the monero folder to the user/system path, sou that you can access the monero binaries easily from the command prompt.

Have an btrfs partition with some space available, and compression DISABLED. Start the monero client to begin synchronization like this, it will create the folder, if it does not exists on target: monerod --data-dir=e:\monero --prune-blockchain --out-peers=128

To stop monerod, hit Ctrl+C, to issue an EOF(end of file) char to the input of monerod. You will also notice a long time the monerod needs to wait for I/O on his finalization process. The monerod takes several minutes to finish waiting I/O.

Monero will be a great tool to help you with IO issues. It will log messages to the console, it is I/O intensive and very stable. It uses two files at a folder called lmdb. The data.mdb will grow up to 65GiB.

I started the sync process on a NTFS partition. I let it sync up to 99%. It took about 24h with a 20megabit Internet link. I got out of space when sync reached 99%. So, I moved the large file to the brtfs.

But now, to my surprise, it took one day to sync 400 blocks, due to slow performance on btrfs. Now, I upgraded to the last version 1.9.0 but the issue is still there. I hope I provided you with a good test case. I will attach some screen captures from version 1.8.2. But after the upgrade, the same slow thing is happening. When it was on NTFS, the disk activity, was about 30MiB/s, and disk usage about 100%.

Now on the btrfs, the disk activity is lagging, almost no internet use, since monerod blocked waiting I/O to finish. btrfswin-issue-slow-update See the last modification time at windows explorer. And the system process eating one CPU thread.

btrfswin-issue-slow-update-monerod-iounlocked Between 5 to 15minutes later, the system process finish the flush of data and unlocks, so the OS push monerod to execution again. And a few seconds later, the OS locks it again and the system process begins to eat the CPU again.

Reproducible: always. To restart the sync from beggining, just delete the *.mdb files.

Maybe some issue related to the FAM(File Alteration Monitor). Hope it helps you with input data, for test case! Monerod also have options to choose the number of threads, in case you want CPU stress with I/O!

maharmstone commented 6 months ago

Thanks, I'll give it a go.

These MDB files look like databases to me - possibly the solution is to set the nocow flag on the directory. Databases don't play well with copy-on-write, on any OS.

sumariva commented 6 months ago

I forget to tell you, Windows Explorer also freeze while waiting I/O. The target system is a Windows 10 x64, on an Intel I3 1.8ghz, notebook, on A/C adapter, adicted/faulty battery. The machine has been rebooted to load 1.9.0.

btrfswin-slow-nocow Activated nocow, stopped and restarted monerod, since I do not know if these nocow has something related with open file descriptors on already running process. But the 25%, I guess a process/thread loaded keeps going on.

I will be running this machine all days, until April 15th. So If you need an experimental rat, I can manage to run versions on this machine before this date. I will be working on it all those days. I have a Linux pen drive, In case need to check btrfs.

I can figure that this is a problem related to random file access. Because, moving the 65GiB from NTFS to BRTFS, was fast. It experimented variable transfer rate, but I guess it is related to memory buffers from OS been filled, so the average transfer rate falls until SSD complete writes and flush those buffers out, when then OS keep pushing more data and up ticking the rate again.