Open UnixJunkie opened 6 years ago
Nice! This would be for the https backend, I guess ?
Well, to compare any repository in a fast way. For example, if the two $HOME/old/opam-repository.merkle and $HOME/new/opam-repository.merkle are the same, then there is no need to update the repository.
Ok, some context to explain my question: currently, update proceeds this way:
① fetch new version → ② generate diff → ③ validate the diff (for signed repos) → ④ apply the diff
When using git, it should already take care of making ① efficient, and we use it to obtain ② for free too. Http, which is still the default, on the other hand, is quite inefficient, so your work would be very helpful there. Opam 1 was first downloading an index file to check for changes, but I found this was quite complex and actually didn't gain anything, so it just downloads a full tar.gz of the repo.
Another operation that is very time-consuming (at least on my machine that's not SSD-based) is building the index of packages, which is just reading all the repository tree. This is done only after opam update
, and then cached to a marshalled file (.opam/repos/state.cache
) ; it may be possible to avoid a full rescan there.
I'd like to ping this issue. opam update
is an extremely slow operation for me, and has always been so.
@aantron - using 2.0.8 or 2.1.0~beta4?
I'm no more working on this prototype, though I think the idea is pretty nice to accelerate diffing with a remote repository.
@dra27 Trying just now, opam 2.0.8 took 3 minutes 48 seconds to do update
, and opam 2.1.0~beta4 took 3 minutes 1 second. Not sure how much of the difference is due to changes and how much due to noise, as I did not rerun.
How many opam remotes / pins do you have?
I ran the commands with --root
, after a fresh init
of that root, so, I think, one remote and zero pins.
EDIT: each command with a separate root.
In case it can still intefere, I have one remote in the default .opam
. The active switch in .opam
at the time had no pins, and there was no local switch.
could you post the log of opam update --debug -vvv
somewhere and maybe tell which command takes the longest?
Just now, I ran
rm -rf ./opam21
fish -c "date; and ./opam-2.1.0-beta4-x86_64-linux init -na --root ./opam21 --disable-sandboxing --bare --debug -vvv; and date" > log-init 2>&1
which took 1 minute 58 seconds and produced this file log-init
.
Then I ran
fish -c "date; and ./opam-2.1.0-beta4-x86_64-linux update --root ./opam21 --debug -vvv; and date" > log-update 2>&1
which took 2 minutes 46 seconds and produced this file log-update
. It looks like /usr/bin/diff
took the most time.
My system is
Linux MSI 4.4.0-18362-Microsoft #1049-Microsoft Thu Aug 14 12:01:00 PST 2020 x86_64 x86_64 x86_64 GNU/Linux
It is WSL 1. The opam21
directory and its parent, my home directory, are inside the WSL Linux filesystem (not outside, on NTFS). The Linux filesystem under WSL, as I understand it, has native performance.
The hardware is a 2020 model ultrabook (MSI GS66) with a fast SSD, etc.
On the whole, this is consistent with my experience of opam update
being very slow (minutes) across multiple machines and OSs. I am not sure if, for example, there have been performance improvements that are, say, being masked by me switching from macOS to Windows+WSL at around the same time. From my uneducated point of view as a user not familiar with the details, this is not a new issue — at least that is my impression. It is also slow in Linux VMs I run, though that might, again, be a problem of the VMs.
EDIT: and I should add, across all versions of opam I have used.
Hm, a while ago we wrote an optimisation where we store the repositories as tar archives, and decompress them to /tmp before using them; it is normally much faster because we only have to read one sequential 4MB file, instead of scanning for 10000 small files across 3 layers of directories.
That is, of course, assuming that the OS is doing caching, and the lifetime of the directory below /tmp means it can be read fully from RAM. It seems that's the part that doesn't hold in your case, so the optimisation doesn't work (and even makes things slightly worse). From a very quick search, this is what I found about WSL1 (https://news.ycombinator.com/item?id=25154300):
Example: Linux local filesystem performance mostly derives from the dentry cache. That cache keeps a parsed representation of whatever the filesystem knows appears on disk. The dentry cache is crucial to pretty much any filesystem system call that does not already involve an open file, and IIRC many that also do. Problem is, that same cache in WSL must be subverted because Linux is not the only thing that can mutate the NTFS filesystem - any Windows program could as well. This one fundamentally unfixable problem alone is probably 80% the reason WSL1 IO perf sucked - because the design absolutely required it.
Problem is, that same cache in WSL must be subverted because Linux is not the only thing that can mutate the NTFS filesystem
This makes me doubt what the comment is referring to. There are two filesystems visible from WSL 1 — some kind of Linux filesystem that is not (at least, originally was not) visible from Windows, and NTFS, which you can access under some slightly awkward paths. It's not clear from the comment whether the person is referring to the latter (NTFS), or still referring to the former.
At least as of two years ago, there was a huge difference in performance between WSL 1's Linux filesystem and WSL accessing NTFS. If the commenter is indeed referring to the latter, then the comment is irrelevant for this issue — both my home directory and /tmp
are not in NTFS in this sense.
The underlying storage for the Linux filesystem is undoubtedly somewhere in NTFS ultimately, but there at least was a clear difference between how WSL treated those files and the files treated as directly in NTFS. I assume that WSL made assumptions about the files it considered as being part of the Linux filesystem that allowed this massive relative increase in performance — again, it's not clear if the comment is referring to these specific assumptions. Is the comment claiming that even that increased performance was poor, and making technical statements about it, or commenting on poor performance when accessing NTFS as NTFS?
So, I'm really not sure what is being said in this comment.
Since then, WSL files have become visible from Windows under some obscure paths, along with a network mount. The performance of both filesystems has increased, especially massively so for accessing the "real" NTFS.
The impression I get from many of the commenters in that thread is that they tried WSL 1 when it was slow for their use case and switched away from it, and retain their impression of WSL from that time. Likewise, the technical claims may be out of date, since there has clearly been a lot of optimization over the years. This is based on some commenters saying they only tried again with WSL2 — since that came out around last summer, this means that they had a gap in WSL 1 experience before the summer, and their last experience with WSL 1 would have been well before that. It seems that a lot of the thread consists of people responding to these stale impressions.
In summary, I don't think much can be learned from the thead without interacting with it and asking for clarification.
Nowadays, I routinely work on WSL 1, in NTFS, using both opam and npm, and I have no complaints about the performance on those workloads.
I agree, that's the first link I found, but a random comment on HN is not a reliable source of information ^^
Whatever the OS, etc., when I have been trying to optimise this, the major cost was from scanning a large number of files across a hierarchy; an area where filesystems themselves expose big differences in performance (ext4 is not very good at it). Here are some random benchmarks on what I have on hand, for scanning a full repo (ls -R
) after dropping the caches:
very old HDD (SATA 2.5), NTFS 53s
old HDD (SATA 3.1), ext4 47s
SSD SATA 3.1, ext4 2.7s
SSD NVMe PCIe4, NTFS/fuseblk 1.8s
SSD NVMe PCIe4, f2fs 0.76s
Basically, with an SSD, the optimisation that uses tar
still helps but doesn't make that big a difference anymore (although reading a 4MB file in any of the cases above is negligible, and un-tarring in RAM before scanning the files should not be either).
In any case, is it possible on WSL1 to try and set /tmp as an in-RAM fs ? That would be worth trying...
VoIFs (/
) and DrvFs (/mnt/
) are indeed at a par these days in WSL 1, and still both rubbish. opam 2.0.5 just took 1:38 to init and 1:30 to update using ~/.opam
and 1:59 and 1:54 respectively to do the same in /mnt/c/Users/DRA/.opam
.
WSL 2 on the other hand on the same machine takes 11 seconds to do init and update in ~/.opam
but of course suffers hugely with the 9p mount of /mnt
taking 7:12 to init and 15:14 to update on this machine.
However, this is nothing to do with the diffing method - the simple fact is that there are too many files. There are clear reasons for using WSL 1 vs WSL 2, but for opam you want to be using 2, or (another) VM. The long-term solution will be (optional) integration of ocaml-tar so that the tarballs are never extracted.
There is a small risk that using ocaml-tar might make things slower on Linux machines.
opam update
is also very slow on my OpenBSD laptop. It typically takes 3-4 minutes.
could you post the log of
opam update --debug -vvv
somewhere and maybe tell which command takes the longest?
Here is the
debug output from opam update
. This run took about 3:45 total. The slowest commands were:
gtar
: 60srmdir
: 60sdiff
: 45sHas it already been considered to use .zip instead of .tar.gz ? Zip files permit random access, which could avoid much of the gtar
and rmdir
overhead.
Are you running with softupdates on your OpenBSD laptop @mndrix? I find it makes a huge difference to opam performance due to all the small files.
Are you running with softupdates on your OpenBSD laptop @mndrix?
I wasn't. Thanks for suggesting it. For anyone who finds this later: mounting /tmp with softdep
improved performance on opam update
by about 30%. Now it takes around 2-3 minutes instead of 3-4.
The slowest commands on my final test run after softdep
were:
diff
: 60sgtar
: 20srmdir
: 10sFrom time to time I get an idea, that maybe we should slice our main opam repository to contain packages not older than two years. It should improve performance of opam update (I didn't measure that, but I'm strongly believe)....
@Kakadu this would be pretty lame if we can only install packages which are less than two years old. An opam update option to consider only packages that appeared after a given date, maybe. But crippling the whole opam-repository would be terrible.
Date is an absurd pruning criterion.
Some software is mostly "done" and stable. Could we dispell the myth that only software with PR churn and an active issue tracker is worth using ? Some people do work well :-)
To give an idea a package like xmlm
had a release in 2013 another one in 2017 (safe string support) and one in 2022 (because of 5.0 deprecations).
Indeed, there are plenty of things in opam update
which can be improved without having to redefine the problem (the slowness of the present implementation is extremely painful on Windows, so it's very much "on the radar" to be improved)
Hello, I have a prototype there: https://github.com/UnixJunkie/opmer Be careful, it is not yet super fast and it might very well contain bugs.
Here is a usage example with two opam-repository checkouts: