kimono-koans / httm

Interactive, file-level Time Machine-like tool for ZFS/btrfs/nilfs2 (and even Time Machine and Restic backups!)
https://crates.io/crates/httm
Mozilla Public License 2.0
1.33k stars 28 forks source link

httm -d -R ~ High CPU Usage #58

Closed vudutech closed 1 year ago

vudutech commented 1 year ago

Ubuntu 22.04 - AMD Ryzen 9 5950X 32GB RAM

When running "httm -d -R ~" CPU spikes and PC becomes unresponsive.

installed from httm_0.17.0_amd64.deb

kimono-koans commented 1 year ago

Appreciate your bug report.

PC becomes unresponsive.

Just for comparison, I have a Atom 4 core NAS with a SSD ZFS home directory and my %CPU in top can spike to ~390%, if this was an uncached first run, but mostly keeps steady around 250%. I've never hung that system, and never hung a terminal during a deleted recursive run, but because you have so many more, faster cores, maybe you are just getting into trouble faster?

Please to assist you to diagnose. But if you want some help, would you also upgrade to 0.17.5? Perhaps try the new PPA. Thanks. Maybe its been hammered out already? I was doing some work re: the recursive model at the time. Could be some weird interaction there?

vudutech commented 1 year ago

You're welcome. Cool project.

Bit surprised myself at the number of sysmlinks!

Have attached strace results, head -n 100 and tail -n 100. Process again stalled after 1 or 2 mins on a PDF motherboard manual. Notably only 1 thread at 100% this time as shown in htop and system remained largely usable.

Also note that I had an issue with bluetooth not functioning today, coupled with or caused by, a power outage overnight which saw me zfs rollback (excluding the /home dir) 24 hours to regain a functioning system.

Will look to upgrade in the coming day or 2, and test again. I've really just been using your provided examples to get a feel for how httm works.

Hope this helps and thanks for your efforts. Let me know if I can assist further.

output-redacted.txt

vudutech commented 1 year ago

Some more strace detail. Original output.txt was >13GB

output-redacted-plus.txt

kimono-koans commented 1 year ago

From what you're telling me, such high CPU usage is pretty normal. httm will gladly use lots of threads, because deleted searches are IO and CPU intensive.

It's still amazing it me that it could stop such a powerful system in its tracks though, but I'm guessing that has something to do with it. I'd take three shots in the dark:

  1. Threading library sees 32 threads available to use and uses all of them and starts 32 concurrent statx calls (or whatever) and your drive gets a little backed up, and although your CPU is being "utilized", its actually stalled in iowait? httm does all sync IO. Maybe async is required for such a fast machine?
  2. Crashing a Chrome thread sounds/feels like a OOM situation caused by something pathological. For instance, my Time Machine directory is pathological. A search will grab 2GB of memory and seem to hang because there is one directory with 14713 files. You have directories with many, many more files. Imagine searching 10 or 20 such directories concurrently. Could still have been a coincidence though?
  3. This may also be an algorithmic complexity thing where using a BTreeMap makes sense most of the time, is better memory wise most of the time, but kills us when we hit directories with large #s of files, and we should use a HashMap (for at least directories with large #s of files). A dir with >5,000 files may just be pathological and perhaps should just be ignored by default? Filesystems have problems enumerating such dirs.

I'm betting its the confluence of all 3.

Some things to play with to get more info:

Check your %iowait in another terminal window while running httm with iostat -m 1 100.

And play around with the # of threads you use to see what works best? You can set the # of threads used with and environment variable: export RAYON_NUM_THREADS=8

vudutech commented 1 year ago

Sounds like a reasonable assessment.

Upgraded to the PPA version. Similar result. iostat and htop info attached. iostat.txt httm-htop

vudutech commented 1 year ago

Somewhat better with "export RAYON_NUM_THREADS=8" but still sluggish. Htop shows cpu slightly lower and jumping around threads.

kimono-koans commented 1 year ago

This PR, https://github.com/kimono-koans/httm/pull/59, should be a ~7x speedup for what I believe to be your issue. Here, we use snapshot birth times instead of file modify times for the policy of which deleted version to use for behind deleted dir enumeration. Should add to a release soon.

Probably not much more low lying performance fruit here.

I'm going to close unless you have something more particular. Again thanks for filing, and feel free to refile if the next release doesn't seem to fix and you have more data re: possible causes!

EDIT: Spoke too soon. https://github.com/kimono-koans/httm/commit/fc8419464f633826c357b5eda2f6cff4aa2abecb is ~60x faster on my system.

vudutech commented 1 year ago

Glad I could help and thanks for your diligence.

Installed latest Update. Big Improvement!

httm -d -R ~ keeps chugging along, mostly. Some slowdowns when processing random files/dirs. Browser cache files and ".cache/plasma_theme_kubuntu_v20.04.3.kcache" seems to often be a problem,

All 32 threads seem to spike briefly then sit around 30-50%. Machine remains mostly responsive. Youtube paused at 1 point for around 10 seconds and then kept running.

CTRL C'd and restarted. Progressed further but occasionally stalls on random files, some only a few KiB's.

Little dots progress indicator stalls sometimes or is replaced with a solid block cursor. Walked away for a coffee (20mins) and the process completed.

kimono-koans commented 1 year ago

Glad it works better.

Little dots progress indicator stalls sometimes or is replaced with a solid block cursor. Walked away for a coffee (20mins) and the process completed.

This is normal. If no new directories are being processed, then the progress indicator will stop. The indicator ticks as each directory is completed. When a directory isn't completed it doesn't tick. So, no tick could indicate a deadlock/stall, but in my experience it means, oops, this directory is super strange.

But I can't reproduce what is likely a pathological directory, unless I know more about what makes that directory/file pathological. I've tested directories with 500,000 files, created a few snapshots, then ran httm, it completes within a few seconds. I regularly run on directories with 20,000 deleted files.

If it is pausing on a file/directory, what is special about that file or that directory? Like:

Browser cache files and ".cache/plasma_theme_kubuntu_v20.04.3.kcache" seems to often be a problem,

Two things:

As it stands, this feels more like an unintuitive ticker, rather than broken behavior or a real performance problem to me. Maybe the solution is "Find another trigger to make the indicator tick".

httm is a system punisher, because there is lots of work to do. That httm uses all 32 threads, just as easily as 4 threads, is awesome. Whether there is additional performance to be gained, as we can see, sure possibly. The question is: At what cost? Does it matter in the 95% case?

But, in order to answer those questions, I'm going to need more info about any problematic directories. So, if you wish to take this further, you'll need to ride shotgun. I'd suggest you take a week and keep playing with httm. Find me something I can reproduce consistently and I'll take another look.