idea for accelerating 'opam update'

UnixJunkie commented 6 years ago

Hello, I have a prototype there: https://github.com/UnixJunkie/opmer Be careful, it is not yet super fast and it might very well contain bugs.

Here is a usage example with two opam-repository checkouts:

opmer -v --hash $HOME/old/opam-repository
opmer -v --hash $HOME/new/opam-repository
opmer --diff $HOME/old/opam-repository:$HOME/new/opam-repository -o diff.log

AltGr commented 6 years ago

Nice! This would be for the https backend, I guess ?

UnixJunkie commented 6 years ago

Well, to compare any repository in a fast way. For example, if the two $HOME/old/opam-repository.merkle and $HOME/new/opam-repository.merkle are the same, then there is no need to update the repository.

AltGr commented 6 years ago

Ok, some context to explain my question: currently, update proceeds this way:

① fetch new version → ② generate diff → ③ validate the diff (for signed repos) → ④ apply the diff

When using git, it should already take care of making ① efficient, and we use it to obtain ② for free too. Http, which is still the default, on the other hand, is quite inefficient, so your work would be very helpful there. Opam 1 was first downloading an index file to check for changes, but I found this was quite complex and actually didn't gain anything, so it just downloads a full tar.gz of the repo.

Another operation that is very time-consuming (at least on my machine that's not SSD-based) is building the index of packages, which is just reading all the repository tree. This is done only after opam update, and then cached to a marshalled file (.opam/repos/state.cache) ; it may be possible to avoid a full rescan there.

aantron commented 3 years ago

I'd like to ping this issue. opam update is an extremely slow operation for me, and has always been so.

dra27 commented 3 years ago

@aantron - using 2.0.8 or 2.1.0~beta4?

UnixJunkie commented 3 years ago

I'm no more working on this prototype, though I think the idea is pretty nice to accelerate diffing with a remote repository.

aantron commented 3 years ago

@dra27 Trying just now, opam 2.0.8 took 3 minutes 48 seconds to do update, and opam 2.1.0~beta4 took 3 minutes 1 second. Not sure how much of the difference is due to changes and how much due to noise, as I did not rerun.

dra27 commented 3 years ago

How many opam remotes / pins do you have?

aantron commented 3 years ago

I ran the commands with --root, after a fresh init of that root, so, I think, one remote and zero pins.

EDIT: each command with a separate root.

In case it can still intefere, I have one remote in the default .opam. The active switch in .opam at the time had no pins, and there was no local switch.

kit-ty-kate commented 3 years ago

could you post the log of opam update --debug -vvv somewhere and maybe tell which command takes the longest?

aantron commented 3 years ago

Just now, I ran

rm -rf ./opam21
fish -c "date; and ./opam-2.1.0-beta4-x86_64-linux init -na --root ./opam21 --disable-sandboxing --bare --debug -vvv; and date" > log-init 2>&1

which took 1 minute 58 seconds and produced this file log-init.

Then I ran

fish -c "date; and ./opam-2.1.0-beta4-x86_64-linux update --root ./opam21 --debug -vvv; and date" > log-update 2>&1

which took 2 minutes 46 seconds and produced this file log-update. It looks like /usr/bin/diff took the most time.

My system is

Linux MSI 4.4.0-18362-Microsoft #1049-Microsoft Thu Aug 14 12:01:00 PST 2020 x86_64 x86_64 x86_64 GNU/Linux

It is WSL 1. The opam21 directory and its parent, my home directory, are inside the WSL Linux filesystem (not outside, on NTFS). The Linux filesystem under WSL, as I understand it, has native performance.

The hardware is a 2020 model ultrabook (MSI GS66) with a fast SSD, etc.

aantron commented 3 years ago

On the whole, this is consistent with my experience of opam update being very slow (minutes) across multiple machines and OSs. I am not sure if, for example, there have been performance improvements that are, say, being masked by me switching from macOS to Windows+WSL at around the same time. From my uneducated point of view as a user not familiar with the details, this is not a new issue — at least that is my impression. It is also slow in Linux VMs I run, though that might, again, be a problem of the VMs.

EDIT: and I should add, across all versions of opam I have used.

AltGr commented 3 years ago

Hm, a while ago we wrote an optimisation where we store the repositories as tar archives, and decompress them to /tmp before using them; it is normally much faster because we only have to read one sequential 4MB file, instead of scanning for 10000 small files across 3 layers of directories.

That is, of course, assuming that the OS is doing caching, and the lifetime of the directory below /tmp means it can be read fully from RAM. It seems that's the part that doesn't hold in your case, so the optimisation doesn't work (and even makes things slightly worse). From a very quick search, this is what I found about WSL1 (https://news.ycombinator.com/item?id=25154300):

Example: Linux local filesystem performance mostly derives from the dentry cache. That cache keeps a parsed representation of whatever the filesystem knows appears on disk. The dentry cache is crucial to pretty much any filesystem system call that does not already involve an open file, and IIRC many that also do. Problem is, that same cache in WSL must be subverted because Linux is not the only thing that can mutate the NTFS filesystem - any Windows program could as well. This one fundamentally unfixable problem alone is probably 80% the reason WSL1 IO perf sucked - because the design absolutely required it.

aantron commented 3 years ago

Problem is, that same cache in WSL must be subverted because Linux is not the only thing that can mutate the NTFS filesystem

This makes me doubt what the comment is referring to. There are two filesystems visible from WSL 1 — some kind of Linux filesystem that is not (at least, originally was not) visible from Windows, and NTFS, which you can access under some slightly awkward paths. It's not clear from the comment whether the person is referring to the latter (NTFS), or still referring to the former.

At least as of two years ago, there was a huge difference in performance between WSL 1's Linux filesystem and WSL accessing NTFS. If the commenter is indeed referring to the latter, then the comment is irrelevant for this issue — both my home directory and /tmp are not in NTFS in this sense.

The underlying storage for the Linux filesystem is undoubtedly somewhere in NTFS ultimately, but there at least was a clear difference between how WSL treated those files and the files treated as directly in NTFS. I assume that WSL made assumptions about the files it considered as being part of the Linux filesystem that allowed this massive relative increase in performance — again, it's not clear if the comment is referring to these specific assumptions. Is the comment claiming that even that increased performance was poor, and making technical statements about it, or commenting on poor performance when accessing NTFS as NTFS?

So, I'm really not sure what is being said in this comment.

Since then, WSL files have become visible from Windows under some obscure paths, along with a network mount. The performance of both filesystems has increased, especially massively so for accessing the "real" NTFS.

The impression I get from many of the commenters in that thread is that they tried WSL 1 when it was slow for their use case and switched away from it, and retain their impression of WSL from that time. Likewise, the technical claims may be out of date, since there has clearly been a lot of optimization over the years. This is based on some commenters saying they only tried again with WSL2 — since that came out around last summer, this means that they had a gap in WSL 1 experience before the summer, and their last experience with WSL 1 would have been well before that. It seems that a lot of the thread consists of people responding to these stale impressions.

In summary, I don't think much can be learned from the thead without interacting with it and asking for clarification.

Nowadays, I routinely work on WSL 1, in NTFS, using both opam and npm, and I have no complaints about the performance on those workloads.

AltGr commented 3 years ago

I agree, that's the first link I found, but a random comment on HN is not a reliable source of information ^^

Whatever the OS, etc., when I have been trying to optimise this, the major cost was from scanning a large number of files across a hierarchy; an area where filesystems themselves expose big differences in performance (ext4 is not very good at it). Here are some random benchmarks on what I have on hand, for scanning a full repo (ls -R) after dropping the caches:

very old HDD (SATA 2.5), NTFS    53s                                                                                                                      
old HDD (SATA 3.1), ext4         47s                                                                                                                      
SSD SATA 3.1, ext4               2.7s                                                                                                                     
SSD NVMe PCIe4, NTFS/fuseblk     1.8s                                                                                                                     
SSD NVMe PCIe4, f2fs             0.76s

Basically, with an SSD, the optimisation that uses tar still helps but doesn't make that big a difference anymore (although reading a 4MB file in any of the cases above is negligible, and un-tarring in RAM before scanning the files should not be either).

In any case, is it possible on WSL1 to try and set /tmp as an in-RAM fs ? That would be worth trying...

dra27 commented 3 years ago

VoIFs (/) and DrvFs (/mnt/) are indeed at a par these days in WSL 1, and still both rubbish. opam 2.0.5 just took 1:38 to init and 1:30 to update using ~/.opam and 1:59 and 1:54 respectively to do the same in /mnt/c/Users/DRA/.opam.

WSL 2 on the other hand on the same machine takes 11 seconds to do init and update in ~/.opam but of course suffers hugely with the 9p mount of /mnt taking 7:12 to init and 15:14 to update on this machine.

However, this is nothing to do with the diffing method - the simple fact is that there are too many files. There are clear reasons for using WSL 1 vs WSL 2, but for opam you want to be using 2, or (another) VM. The long-term solution will be (optional) integration of ocaml-tar so that the tarballs are never extracted.

UnixJunkie commented 3 years ago

There is a small risk that using ocaml-tar might make things slower on Linux machines.

mndrix commented 2 years ago

opam update is also very slow on my OpenBSD laptop. It typically takes 3-4 minutes.

could you post the log of opam update --debug -vvv somewhere and maybe tell which command takes the longest?

Here is the debug output from opam update. This run took about 3:45 total. The slowest commands were:

gtar: 60s
rmdir: 60s
diff: 45s
download: 20s

Has it already been considered to use .zip instead of .tar.gz ? Zip files permit random access, which could avoid much of the gtar and rmdir overhead.

avsm commented 2 years ago

Are you running with softupdates on your OpenBSD laptop @mndrix? I find it makes a huge difference to opam performance due to all the small files.

mndrix commented 2 years ago

Are you running with softupdates on your OpenBSD laptop @mndrix?

I wasn't. Thanks for suggesting it. For anyone who finds this later: mounting /tmp with softdep improved performance on opam update by about 30%. Now it takes around 2-3 minutes instead of 3-4.

The slowest commands on my final test run after softdep were:

diff: 60s
download: 20s
gtar: 20s
rmdir: 10s

Kakadu commented 1 year ago

From time to time I get an idea, that maybe we should slice our main opam repository to contain packages not older than two years. It should improve performance of opam update (I didn't measure that, but I'm strongly believe)....

UnixJunkie commented 1 year ago

@Kakadu this would be pretty lame if we can only install packages which are less than two years old. An opam update option to consider only packages that appeared after a given date, maybe. But crippling the whole opam-repository would be terrible.

dbuenzli commented 1 year ago

Date is an absurd pruning criterion.

Some software is mostly "done" and stable. Could we dispell the myth that only software with PR churn and an active issue tracker is worth using ? Some people do work well :-)

To give an idea a package like xmlm had a release in 2013 another one in 2017 (safe string support) and one in 2022 (because of 5.0 deprecations).

dra27 commented 1 year ago

Indeed, there are plenty of things in opam update which can be improved without having to redefine the problem (the slowness of the present implementation is extremely painful on Windows, so it's very much "on the radar" to be improved)

ocaml / opam

idea for accelerating 'opam update' #3050