Open mwleeds opened 3 years ago
Hmm, not much I can think to do better there since you have to compress the objects sooner or later. The amount of time looking up symbols in the dynamic linker seems odd, though.
One thing you could do is turn the compression level way down. I've never actually done this, but I believe in the destination repo you would put this in the config:
[archive]
zlib-level=1
Whereas the default is 6. If you're only using archive mode for the handling of metadata and not for the size savings, then this might be a win since you'd presumably spend less time in zlib. See https://github.com/ostreedev/ostree/blob/master/src/libostree/ostree-repo.c#L2964, but I'm not totally sure that's used when pulling objects.
Interesting idea but it doesn't seem to make a measurable difference at least for runtime/org.videolan.VLC.Plugin.makemkv/x86_64/3-19.08
. With the lower compression level the repo on the drive ended up being 13 MB instead of 12 MB but the copy took slightly more time (13.920s vs 13.767s). I deleted the repo and re-initialized it with ostree --repo=/media/mwleeds/updates/.ostree/repo init --mode=archive
before each trial.
Sysprof tells a slightly different story than perf, that most of the time is spent in kernel space. Would possibly be worth looking into why __fxstatat
accounts for so much.
In this case the command was: /usr/bin/ostree --repo=/var/lib/flatpak/repo create-usb /media/mwleeds/updates/ org.flathub.Stable app/org.gnome.Builder/x86_64/stable
@wjt pointed out that the Sysprof info is probably incomplete due to -fomit-frame-pointer
being used to compile Fedora.
At hughsie's suggestion, I tried a few different combinations of filesystems and repo modes on the destination drive (the source repo being in bare
mode). I deleted and recreated the destination repo in each case and used ostree --repo=/var/lib/flatpak/repo create-usb /run/media/e/updates org.flathub.Stable app/org.libreoffice.LibreOffice/x86_64/stable
.
⏳ | FAT32 | ext4 |
---|---|---|
archive |
14m56s | 1m45s |
bare |
(errored out with fchown: Operation not permitted ) |
(errored out with fchown: Operation not permitted ) |
bare-user |
(ostree init errored out with fsetxattr(user.ostreemeta): Operation not permitted ) |
2m40s |
bare-user-only |
(errored out with Writing content object: Operation not permitted ) |
2m46s |
The performance difference in archive mode was remarkable, so I ran it again to be sure and got 14m12s
and 1m22s
the second time, on FAT32
and ext4
respectively. I also unmounted the filesystem soon after the command completed, to be sure the data was actually copied to it and not in a cache. I guess the question here becomes "Why is create-usb so slow when using FAT?"
Rough. My guess is that something in the kernel fat filesystem is not optimized for the pattern ostree is using. I don't know if you have this kind of control in the real scenario, but you could try formatting the filesystem with exfat. Then make sure exfat-fuse
isn't installed so you get the kernel's native exfat filesystem.
Also, I think you can simplify your testing a bit by taking the USB out of the setup. You can just create some loopback filesystems. Like so:
truncate -s 2G fat.img
/sbin/mkfs.fat fat.img
fat_loop=$(sudo losetup --show -f fat.img)
mkdir fat
sudo mount -o uid=$(id -u),gid=$(id -g) $fat_loop fat
truncate -s 2G exfat.img
/sbin/mkfs.exfat exfat.img
exfat_loop=$(sudo losetup --show -f exfat.img)
mkdir exfat
sudo mount -o uid=$(id -u),gid=$(id -g) $exfat_loop exfat
truncate -s 2G ext4.img
/sbin/mkfs.ext4 ext4.img
ext4_loop=$(sudo losetup --show -f ext4.img)
mkdir ext4
sudo mount $ext4_loop ext4
sudo chown $(id -u):$(id -g) ext4
When you're done, umount
each one and losetup -d
each loop device. You can also consider sync -f
for ensuring the data is written without unmounting.
Rough. My guess is that something in the kernel fat filesystem is not optimized for the pattern ostree is using. I don't know if you have this kind of control in the real scenario, but you could try formatting the filesystem with exfat. Then make sure
exfat-fuse
isn't installed so you get the kernel's native exfat filesystem.Perhaps that's worth trying but unless we're going to require the stick be re-formatted before allowing a copy to it, that won't help for the (majority of?) USB sticks in the wild that are already FAT formatted.
Also, I think you can simplify your testing a bit by taking the USB out of the setup. You can just create some loopback filesystems. Like so:
truncate -s 2G fat.img /sbin/mkfs.fat fat.img fat_loop=$(sudo losetup --show -f fat.img) mkdir fat sudo mount -o uid=$(id -u),gid=$(id -g) $fat_loop fat truncate -s 2G exfat.img /sbin/mkfs.exfat exfat.img exfat_loop=$(sudo losetup --show -f exfat.img) mkdir exfat sudo mount -o uid=$(id -u),gid=$(id -g) $exfat_loop exfat truncate -s 2G ext4.img /sbin/mkfs.ext4 ext4.img ext4_loop=$(sudo losetup --show -f ext4.img) mkdir ext4 sudo mount $ext4_loop ext4 sudo chown $(id -u):$(id -g) ext4
When you're done,
umount
each one andlosetup -d
each loop device. You can also considersync -f
for ensuring the data is written without unmounting.
Thanks that might be useful, but I think we want to stay as close to the real world scenario as possible when gathering data.
I gathered some backtraces with create-usb running under gdb on each filesystem in case they're illuminating but haven't really dug into them yet: create-usb-ext4-backtraces.log create-usb-fat-backtraces.log
@cgwalters let me know if I'm wrong here, but I think the way the data gets copied is via glnx_regfile_copy_bytes
. That first tries a reflink, then copy_file_range, then sendfile and finally just a regular read/write loop.
It would be interesting to know which one of these paths fat ends up in. My guess is that it has to fall back to a read/write loop due to lack of support in the filesystem. Not that anything can be done there, but my guess is that the actual copying of data is where this is taking the most time. On the other hand, if your sysprof profile is right, then maybe fstatat
is just excessively slow on fat.
Someone on StackOverflow suggested the async,flush
mount options would help, but it is already mounted with flush
by default, and when I manually remount it with async
that option doesn't show up in the mount
output so I guess it's ignored. It makes sense that it would help for the data to be written to the drive asynchronously with a buffer but I haven't been able to verify yet.
@mwleeds could you try it under strace again but with -T -ttt
to get timestamps and the time spent in each syscall? Also probably -f
since pull (and glib in general) are multi-threaded.
I wrote a script to test what I was thinking about in https://gist.github.com/dbnicholson/ef93d0cb2b262212566f5e71f04a6502. Here's a strace summary comparison of fat, exfat and ext4:
fat summary:
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
25.52 8.188703 306 26740 3663 futex
25.40 8.151031 303 26815 poll
15.19 4.874937 108 45113 39691 newfstatat
7.53 2.415129 144 16742 5420 openat
4.95 1.588341 87 18131 write
4.58 1.469256 285 5144 renameat2
4.34 1.392292 81 17165 read
2.80 0.898118 79 11326 close
2.60 0.834847 79 10553 10028 mkdirat
2.56 0.822568 82 9934 lseek
1.30 0.418655 81 5147 renameat
1.29 0.414929 80 5147 fchmod
1.24 0.396960 82 4790 flistxattr
0.15 0.048280 30 1606 fstat
0.13 0.042539 40 1052 getdents64
0.10 0.030876 120 257 fsync
0.06 0.018448 9224 2 fstatfs
0.05 0.015797 64 246 copy_file_range
0.05 0.014989 59 251 248 ioctl
0.04 0.012151 11 1021 mprotect
0.03 0.008154 75 108 llistxattr
0.02 0.007578 70 108 readlinkat
0.02 0.007388 6 1058 fcntl
0.01 0.003794 3794 1 syncfs
0.01 0.003520 13 263 2 unlinkat
0.00 0.001152 3 321 mmap
0.00 0.000831 63 13 clone
0.00 0.000700 28 25 munmap
0.00 0.000539 49 11 brk
0.00 0.000472 33 14 set_robust_list
0.00 0.000434 39 11 prctl
0.00 0.000371 33 11 madvise
0.00 0.000355 32 11 sched_setattr
0.00 0.000326 29 11 gettid
0.00 0.000229 114 2 fdatasync
0.00 0.000022 7 3 fallocate
0.00 0.000018 0 21 8 stat
0.00 0.000009 9 1 getrandom
0.00 0.000005 5 1 utimensat
0.00 0.000002 2 1 getuid
0.00 0.000000 0 5 rt_sigaction
0.00 0.000000 0 3 rt_sigprocmask
0.00 0.000000 0 3 3 access
0.00 0.000000 0 1 execve
0.00 0.000000 0 1 sysinfo
0.00 0.000000 0 2 2 statfs
0.00 0.000000 0 1 arch_prctl
0.00 0.000000 0 1 set_tid_address
0.00 0.000000 0 3 faccessat
0.00 0.000000 0 6 eventfd2
0.00 0.000000 0 1 prlimit64
0.00 0.000000 0 1 sched_getattr
------ ----------- ----------- --------- --------- ----------------
100.00 32.084745 153 209205 59065 total
exfat summary:
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
31.09 4.243226 149 28351 3844 futex
26.60 3.630377 136 26547 poll
12.63 1.723387 38 45113 39691 newfstatat
5.90 0.805443 48 16748 5422 openat
5.33 0.727306 40 17860 write
4.61 0.629284 37 16900 read
2.73 0.372877 32 11330 close
2.69 0.367069 36 9934 lseek
1.95 0.266417 51 5144 renameat2
1.92 0.262108 24 10553 10028 mkdirat
1.37 0.186924 36 5147 fchmod
1.27 0.173778 36 4790 flistxattr
0.78 0.106253 20 5147 renameat
0.38 0.052010 49 1052 getdents64
0.34 0.045983 178 257 fsync
0.15 0.020269 12 1606 fstat
0.05 0.006552 26 246 copy_file_range
0.05 0.006474 6 1014 mprotect
0.04 0.005337 21 253 250 ioctl
0.03 0.003829 3 1058 fcntl
0.03 0.003782 35 108 readlinkat
0.02 0.003132 29 108 llistxattr
0.02 0.002901 10 265 2 unlinkat
0.02 0.002337 7 321 mmap
0.02 0.002168 2168 1 syncfs
0.00 0.000333 12 26 munmap
0.00 0.000159 14 11 brk
0.00 0.000106 17 6 eventfd2
0.00 0.000097 7 13 clone
0.00 0.000081 7 11 madvise
0.00 0.000039 1 21 8 stat
0.00 0.000029 2 11 prctl
0.00 0.000024 8 3 rt_sigprocmask
0.00 0.000011 3 3 faccessat
0.00 0.000011 0 14 set_robust_list
0.00 0.000010 10 1 sched_getattr
0.00 0.000009 0 11 sched_setattr
0.00 0.000008 0 11 gettid
0.00 0.000003 3 1 getrandom
0.00 0.000002 0 5 rt_sigaction
0.00 0.000000 0 3 3 access
0.00 0.000000 0 1 execve
0.00 0.000000 0 2 fdatasync
0.00 0.000000 0 1 sysinfo
0.00 0.000000 0 1 getuid
0.00 0.000000 0 2 2 statfs
0.00 0.000000 0 2 fstatfs
0.00 0.000000 0 1 arch_prctl
0.00 0.000000 0 1 set_tid_address
0.00 0.000000 0 1 utimensat
0.00 0.000000 0 3 3 fallocate
0.00 0.000000 0 1 prlimit64
------ ----------- ----------- --------- --------- ----------------
100.00 13.650145 64 210020 59253 total
ext4 summary:
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
31.48 2.913297 102 28370 3811 futex
27.25 2.521613 95 26523 poll
12.56 1.162287 25 45113 39691 newfstatat
5.35 0.495051 27 17868 write
4.39 0.406229 24 16874 read
3.39 0.313524 27 11591 271 openat
2.92 0.269855 23 11324 close
2.81 0.260116 25 10295 fchmod
2.72 0.251473 25 9934 lseek
2.09 0.193444 18 10553 10028 mkdirat
1.68 0.155772 30 5147 linkat
1.37 0.127162 26 4790 flistxattr
1.16 0.107166 20 5147 renameat
0.18 0.016350 10 1606 fstat
0.17 0.015664 14 1052 getdents64
0.12 0.011223 43 257 fsync
0.07 0.006587 6 1058 fcntl
0.04 0.003740 3 1019 mprotect
0.04 0.003655 14 246 copy_file_range
0.04 0.003583 14 250 247 ioctl
0.03 0.002857 2857 1 syncfs
0.03 0.002639 24 108 readlinkat
0.03 0.002635 8 321 mmap
0.03 0.002543 23 108 llistxattr
0.03 0.002388 9 261 2 unlinkat
0.01 0.001196 108 11 madvise
0.01 0.000573 20 28 munmap
0.00 0.000333 25 13 clone
0.00 0.000307 27 11 prctl
0.00 0.000290 26 11 brk
0.00 0.000287 143 2 fdatasync
0.00 0.000256 23 11 gettid
0.00 0.000256 18 14 set_robust_list
0.00 0.000170 15 11 sched_setattr
0.00 0.000162 162 1 execve
0.00 0.000078 26 3 fallocate
0.00 0.000045 2 21 8 stat
0.00 0.000032 10 3 rt_sigprocmask
0.00 0.000029 4 6 eventfd2
0.00 0.000024 24 1 utimensat
0.00 0.000014 4 3 faccessat
0.00 0.000007 2 3 3 access
0.00 0.000004 4 1 arch_prctl
0.00 0.000003 3 1 sched_getattr
0.00 0.000002 0 5 rt_sigaction
0.00 0.000000 0 1 sysinfo
0.00 0.000000 0 1 getuid
0.00 0.000000 0 2 2 statfs
0.00 0.000000 0 2 fstatfs
0.00 0.000000 0 1 set_tid_address
0.00 0.000000 0 1 prlimit64
0.00 0.000000 0 1 getrandom
------ ----------- ----------- --------- --------- ----------------
100.00 9.254921 44 209985 54063 total
Executive summary: everything is much slower on fat.
Well, I think one thing we can say is there's a lot of wasted time in fstatat
considering that 88% of them end in errors (presumably ENOENT
).
I don't know why create-usb
on exfat
works for you. For me I get this error:
e@endless:~$ time ostree --repo=/var/lib/flatpak/repo create-usb /run/media/e/CA69-5059/ org.flathub.Stable app/org.libreoffice.LibreOffice/x86_64/stable
error: Importing a04334a21363a6f337aeae6efe55af43b5e42e9b108346acba9be6e790425fac.commit: renameat: Function not implemented
real 0m0.121s
user 0m0.074s
sys 0m0.017s
Just wanted to reproduce your results.
What kernel are you running? Is the USB mounted with fuse?
What kernel are you running?
$ uname -a
Linux endless 5.8.0-14-generic #15+dev113.1535365beos3.9.2 SMP Sun Jan 31 00:28:42 UTC 2021 x86_64 GNU/Linux
Is the USB mounted with fuse?
Not sure how to check, whatever happens by default?
Hmm, I was running 5.10 from master. Run findmnt /run/media/e/CA69-5059
or just mount
and look for the path you want. If it says exfat
in the FSTYPE column, it's the kernel's exfat module. If it says fuseblk
, then it's exfat-fuse.
Hmm, I was running 5.10 from master. Run
findmnt /run/media/e/CA69-5059
or justmount
and look for the path you want. If it saysexfat
in the FSTYPE column, it's the kernel's exfat module. If it saysfuseblk
, then it's exfat-fuse.
It says fuseblk
Someone on StackOverflow suggested the
async,flush
mount options would help, but it is already mounted withflush
by default, and when I manually remount it withasync
that option doesn't show up in themount
output so I guess it's ignored. It makes sense that it would help for the data to be written to the drive asynchronously with a buffer but I haven't been able to verify yet.
I think async
is just omitted from the options list since it's the default. And it seems there's about a 20-28% performance improvement when async
is used:
e@endless:~$ sudo mount -o remount,sync,rw,nosuid,nodev,relatime,uid=1000,gid=1000,fmask=0022,dmask=0022,codepage=437,iocharset=iso8859-1,shortname=mixed,showexec,utf8,flush,errors=remount-ro,uhelper=udisks2 /dev/sdb1 /run/media/e/3376-D9B3
e@endless:~$ rm -rf /run/media/e/3376-D9B3/.ostree/
e@endless:~$ time (ostree --repo=/var/lib/flatpak/repo create-usb /run/media/e/3376-D9B3/ org.flathub.Stable app/org.libreoffice.LibreOffice/x86_64/stable && umount /run/media/e/3376-D9B3)
real 23m56.483s
user 1m4.541s
sys 0m33.017s
e@endless:~$ sudo mount -o remount,async,rw,nosuid,nodev,relatime,uid=1000,gid=1000,fmask=0022,dmask=0022,codepage=437,iocharset=iso8859-1,shortname=mixed,showexec,utf8,flush,errors=remount-ro,uhelper=udisks2 /dev/sdb1 /run/media/e/3376-D9B3
e@endless:~$ rm -rf /run/media/e/3376-D9B3/.ostree/
e@endless:~$ time (ostree --repo=/var/lib/flatpak/repo create-usb /run/media/e/3376-D9B3/ org.flathub.Stable app/org.libreoffice.LibreOffice/x86_64/stable && umount /run/media/e/3376-D9B3)
real 17m12.590s
user 1m1.364s
sys 0m7.077s
I repeated both runs and got 21m43s and 17m27s for sync
and async
respectively.
It would be neat if ostree create-usb
could try remounting the drive with async
, but it would probably make more sense for eos-updater-prepare-volume
to do it, since that is already running as root (at the cost of other users of ostree create-usb
not getting this performance benefit).
I believe that the majority of the fstatat calls are in the pull code to determine if an object is needed or not. First it checks if the object is in the objects directory then if it's in the transaction's staging directory. The optimization that immediately comes to mind is to enumerate all the objects first in a set then do determinations off of that. The staging directory is locked exclusively, so there's no chance of concurrent writers. The objects directory isn't, but then you just have the chance that you might duplicate pulling an object with a concurrent process.
Another thought is whether faccessat might be cheaper than fstatat if all you care about is existence. I don't know if that's actually a win or not, but could be interesting to see if it makes a difference at thousands of calls.
Hmm, I was running 5.10 from master. Run
findmnt /run/media/e/CA69-5059
or justmount
and look for the path you want. If it saysexfat
in the FSTYPE column, it's the kernel's exfat module. If it saysfuseblk
, then it's exfat-fuse.
Oh yeah it works for me too on master.
e@endless:~$ mount | grep exfat
/dev/sdb1 on /run/media/e/C612-3A01 type exfat (rw,nosuid,nodev,relatime,uid=1000,gid=1000,fmask=0022,dmask=0022,iocharset=utf8,errors=remount-ro,uhelper=udisks2)
e@endless:~$ rm -rf /run/media/e/C612-3A01/.ostree/
e@endless:~$ time (ostree --repo=/var/lib/flatpak/repo create-usb /run/media/e/C612-3A01/ org.flathub.Stable app/org.libreoffice.LibreOffice/x86_64/stable && umount /run/media/e/C612-3A01)
real 3m40.622s
user 1m1.350s
sys 0m4.870s
So, I was only half right on the fstatats. It does happen during the pull via ostree_repo_has_object
, but then it also happens when writing out the objects via _ostree_repo_has_loose_object
. The former is just a wrapper around the latter to check parent repos as well. If you're in a transaction, these check first in the staging directory and then in the objects directory.
During the pull, before the next needed object is pulled, it's checked to see if it's in the repo. That makes sense since you don't want to pull something you already have. During the writing out phase it checks again to skip a potentially expensive write that might not be needed. Again, that makes sense as you don't want to overwrite something with the same content. Still, that's 4 stats per new object.
Since the application has a shared lock on the repo during the pull, then no prune can be run (which would require an exclusive lock). That actually didn't exist when a lot of the pull/write code was written. If you could enumerate all the objects at the beginning of the pull, you could skip all of the stats with the potential that you might duplicate work with a concurrent application doing a pull or other commit. But it should be safe to do since you wouldn't decide to skip an object that might disappear.
The other thing in my strace summary was a lot of time blocking in futex. That would tell me there's a lot of lock contention and when I was browsing through the code it really seemed like the txn_lock
is taken a lot to protect the transaction stats. These are all integers that could easily be handled locklessly with the g_atomic_*
family, I think. Like this kind of change:
diff --git a/src/libostree/ostree-repo-commit.c b/src/libostree/ostree-repo-commit.c
index 9d5c6d3b..4c521e49 100644
--- a/src/libostree/ostree-repo-commit.c
+++ b/src/libostree/ostree-repo-commit.c
@@ -1453,10 +1451,8 @@ write_metadata_object (OstreeRepo *self,
}
/* Update the stats, note we both wrote one and add to total */
- g_mutex_lock (&self->txn_lock);
- self->txn.stats.metadata_objects_written++;
- self->txn.stats.metadata_objects_total++;
- g_mutex_unlock (&self->txn_lock);
+ g_atomic_inc (&self->txn.stats.metadata_objects_written);
+ g_atomic_inc (&self->txn.stats.metadata_objects_total);
if (out_csum)
*out_csum = ostree_checksum_to_bytes (actual_checksum);
Are atomic integers expensive? Maybe, but I'd wager that it would be worth the reduced contention on a mutex guarding like 10 different fields.
On the Endless phabricator instance, I broke up the task of improving the performance of eos-updater-prepare-volume
(which uses flatpak create-usb
and ostree create-usb
) into these subtasks:
fstatat()
calls in ostree pull code to improve create-usb performancesync
mount option for drives used with "Copy to USB" featurefaccessat()
in ostree create-usb(Sorry the ticket numbers are useless to those outside Endless)
I took a look at using atomic ints for the transaction stats and it was in the noise. Still might be useful since it's silly to guard so many integer fields with a single mutex.
One thing I did confirm is that this isn't a create-usb issue but really a pull-local issue. I ran create-usb and a similar pull-local and the profiles were almost identical. Which makes sense - all of the work is pulling the objects from one repo to the other. About the only other thing that happens is setting up the target repo. That would just make any optimization work simpler.
I took a look at using atomic ints for the transaction stats and it was in the noise. Still might be useful since it's silly to guard so many integer fields with a single mutex.
Hmm, yes that could help, though by far the best thing here would be to gather the stats per-thread and then merge them at the end (or periodically). This is how things like the kernel per-interface packet counter stats work. It's a map-reduce model instead of a database model.
Unfortunately implementing that will be tricky because we've basically committed to supporting arbitrary threads created externally in the repo API. We would need to have something like ostree_repo_import_object_with_stats()
that passed back that data to the caller, and the pull API to use it...messy.
Eventually io-uring is going to propagate far enough that we can hard depend on it. For cases like this it will likely be a huge win, since things like our fstatat()
calls can become asynchronous. For now we could probably threadpool those.
Following on from the mailing list discussion, I'm opening an issue here so I can upload a few artifacts here.
The discussion is about how the performance of
ostree create-usb
could be improved, as it is fairly slow.I tried running it under perf using this command:
rm -rf /media/mwleeds/updates/.ostree && perf record -F 99 --call-graph dwarf ostree --repo=/var/lib/flatpak/repo create-usb /media/mwleeds/updates/ org.flathub.Stable runtime/org.videolan.VLC.Plugin.makemkv/x86_64/3-19.08
Here'sperf report
:And it seems much of the time is spent in
libz.so
doing compression. In this case the source repository has modebare-user-only
and the destination one has modearchive-z2
. It's interesting that in the PR which madecreate-usb
use archive mode in the destination repo (#1733) I said it was a performance improvement (in that case I think the source repo would have been modebare
since that's what it is on Endless OS.)I also tried running
create-usb
understrace
, both with and without--disable-fsync
:strace.log
strace-disable-fsync.log
That confirmed that
--disable-fsync
avoids about 115fsync()
calls, as expected (but delayingfsync()
to the end of the copy doesn't seem to make a notable difference for performance). Oddly, the log also shows accesses for several refs other than the one specified, such as fororg.gimp.GIMP.Manual
. So that may merit some investigation.