sbopkg / sbopkg

Slackbuilds.org Package Browser
https://sbopkg.org/
Other
122 stars 45 forks source link

Speed up the 'View READMEs' feature by eliminating unnecessary `find` ops #73

Closed spillner closed 2 years ago

spillner commented 2 years ago

I noticed that performing "View READMEs" on a queue took a surprising amount of time, checked the source and saw why. I think this method works just as well, and completes instantaneously.

willysr commented 2 years ago

can you remove the leftovers .un~ files?

spillner commented 2 years ago

Done! Sorry, didn't even notice those crept into the commit.

willysr commented 2 years ago

which package did you find it slow? i tested both with and without this patch and README is shown instantly

spillner commented 2 years ago

I built a queue (in my case via the Update tool, but it would be the same manually), went into View Queue, and selected "View READMEs". The original code ran two finds down to -maxdepth 3 across the full repository, which took probably at least 15 seconds. I guess that's not outrageous, but it's long enough to be noticeable, and totally unnecessary since we're only interested in a select few of the files, and can find them with a glob on $REPO_DIR/*/$NAME/README. The old code first built a list of all the README and .info files in the repo, and then grepd for "/$NAME/". The for loops guard against the same package name appearing in more than one category (shouldn't happen, but who knows), in which case it will include the contents of all matching files, just as the old grep system did. The WARNINGs if a README or .info is missing (again, shouldn't happen, but it's not hard to handle it gracefully) are new but seemed appropriate. Other than the warnings (which should never be triggered in a well-formed repository), the displayed output should be identical between the old and new versions of the code.

willysr commented 2 years ago

ok, which package did you try to build? i think if you run it for the first time, it would take some time to process, but once it's cached, it will be fast

spillner commented 2 years ago

I started out on the sbopkg-0.38.1-noarch package from slackware64-current, and that one was slow for me. You're right that after the first view_readmes() operation the local repository will be in the filesystem cache, and subsequent calls will be faster for a while. However, eventually the repo directory will be evicted from the cache (maybe even by a long build process) and view_readmes() will slow down again because it regenerates the temporary index files every time it is called. (Note also that my wildcard expansion approach benefits from the directories being cached just as much as the find does.)

One alternative would be to not regenerate the READMES/INFOS files if they already exist, and just delete them when the repo is re-synced. That might have been the original intent, since they're not explicitly deleted at the end of view_readmes(). That could be the highest-performance option for users who invoke view_readmes() a lot, but creates some hassle for those who manually tweak their repositories (e.g. adding their own local SlackBuilds) between repo_sync() operations. I don't think it's worth the tradeoff, but making the index files persistent seems like the only way to justify building a full index of README/.info files for every package when only a few at a time are actually being sought.

willysr commented 2 years ago

0.38.1 is very old compared to master can you test with this one? https://sbopkg.org/test/sbopkg-0.38.2-noarch-1_wsr.tgz you need to change the sbopkg.conf to master instead of 15.0 since the repo doesn't exist yet

spillner commented 2 years ago

Just tested; you're right that it has a lot to do with the filesystem cache. I ran sync && echo 3 > /proc/sys/vm/drop_caches before each test to try to ensure the directory cache was cold, but I still didn't get perfectly repeatable results.

My very first test of the sequence was with your 0.38.2, where I measured 39 seconds to run the first "View READMEs" on the list of all installed SBos (399 on my system), and about 1 second for each subsequent run. With the cache still warm, I also manually built a queue of ten packages, and "View READMEs" on the queue also took about one second.

I then flushed the cache and tested my version and I got 9 seconds to run the first "View READMEs" on the list of all installed SBos, and 1-2 seconds for subsequent runs. For the manual queue of ten packages, it completes instantaneously.

I flushed the cache again and re-tested your 0.38.2 several times, measuring 4.5-6.0 seconds for the initial "View READMEs" on the 399 installed SBOs, and 1.0-2.0 seconds for subsequent attempts. I also tested running "View READMEs" on the manual ten-package queue first and got 3.5 seconds for the first operation, and about 1.0 for subsequent.

I flushed again and re-tested my version several times, measuring 3.8-5.5 seconds for initial "View READMEs" on 399 packages, 0.8-1.5 seconds for subsequent attempts, and 2.8 seconds when running the manual ten-package queue first, with all subsequent attempts essentially instantaneous (too fast to measure, even from video of the screen).

Anyway, overall performance difference on a warm cache is minimal, and can favor the find technique when checking for hundreds of packages at once, although there is a clear advantage to the wildcard expansion for smaller queues. Start-up times were also usually minimal, except for the one 39-second outlier on my very first test, that probably mirrors whatever I saw a few days ago. It could be that my echo > drop_caches isn't flushing everything I ought to, or that some other background process interfered with that test. When I get a good time for some reboots I'll try testing on a freshly-booted system to see whether the long initial times are consistently reproducible.

willysr commented 2 years ago

Thanks for the thorough investigations i will hold this PR until further results coming in since we know that the slow results is due to not being cached for initial run

spillner commented 2 years ago

It looks like the find operations aren't the problem. I rebooted several times and observed the first "View README" on the list of 399 installed packages taking anywhere between 5 and 45 seconds with either method (*/README glob or find). I probably should have mentioned earlier that I'm using ZFS, which I think is probably responsible for the long initial search times and also the fact that writing to /proc/sys/vm/drop_caches doesn't necessarily reset the timings. Neither zpool status nor dmesg show any indications of hardware faults, so I think ZFS is just not very efficient for directory searches in some cases, perhaps when subdirectory changes were widely spaced in time and live in different zpool blocks, and sbopkg is just where I happen to be seeing the symptoms.

Anyway, I no longer believe that there's a compelling performance case for this PR--- regret the noise.