High-level squashfuse optimization opportunities

haampie commented 2 years ago

I've documented some performance issues with squashfuse here: https://github.com/haampie/squashfs-mount/. I'm observing about a 1.5x increase in compile time of LLVM when mounting compilers from a squashfs file using squashfuse compared to just using (lib)mount.

Is this overhead expected?

vasi commented 2 years ago

You would definitely expect overhead from FUSE over an in-kernel driver, yes.

haampie commented 2 years ago

Okay, then I'll stick to the kernel version for now. Thanks!

haampie commented 1 year ago

FWIW: when using squashfuse_ll instead of squashfuse I get a 10x speedup for du -sh mountpoint/.

Perf tells me squashfuse spends the vast majority of time decompressing whereas squashfuse_ll spends like 5% of the time there.

Is there any reason to keep the high-level version if the low-level version performs so much better?

vasi commented 1 year ago

The high-level version uses a simpler FUSE API, which has wider availability. A number of platforms (Minix, NetBSD) only support the high-level API. Unfortunately, the high-level API doesn't map one-to-one(-ish) with kernel VFS operations, but instead talks to a library that manages things like inode allocation, that make it inherently slower. Something that hits many different inodes, like du, should be particularly bad.

If squashfuse_ll works better for you, then I recommend sticking with it! But I'd like to keep squashfuse available on other platforms, and there's no harm in leaving the high-level version around.

Let me rename this ticket to something about optimizing high-level squashfuse, since that seems to be where this has landed. Please go ahead and explain how you did your testing, and share what results you got. Then we can use this as an opportunity for anybody who wants to spend time optimizing high-level squashfuse.

haampie commented 1 year ago

Using squashfuse the timing is consistently:

$ time du -sh /x
43G /x

real    0m12.548s
user    0m0.040s
sys 0m0.592s

$ time du -sh /x
43G /x

real    0m12.450s
user    0m0.024s
sys 0m0.569s

$ time du -sh /x
43G /x

real    0m12.397s
user    0m0.059s
sys 0m0.526s

there's no caching effects.

squashfuse_ll is 13x better the first and 45x better the second and later runs:

$ squashfuse_ll file.squashfs /x

$ time du -sh /x
42G /x

real    0m0.902s
user    0m0.040s
sys 0m0.405s

$ time du -sh /x
42G /x

real    0m0.275s
user    0m0.018s
sys 0m0.167s

$ time du -sh /x
42G /x

real    0m0.269s
user    0m0.005s
sys 0m0.176s

mount is best:

$ time du -sh /x
42G /x

real    0m0.527s
user    0m0.020s
sys 0m0.497s

$ time du -sh /x
42G /x

real    0m0.108s
user    0m0.032s
sys 0m0.075s

$ time du -sh /x
42G /x

real    0m0.109s
user    0m0.028s
sys 0m0.080s

Perf shows squashfuse spends all its time decompressing:

# Children      Self  Command     Shared Object       Symbol                                                   
# ........  ........  ..........  ..................  .........................................................
#
    44.07%    44.06%  squashfuse  libzstd.so.1.5.2    [.] ZSTD_decompressBlock_internal.part.13
            |          
             --44.04%--ZSTD_decompressBlock_internal.part.13

    16.41%    16.41%  squashfuse  libzstd.so.1.5.2    [.] _HUF_decompress4X1_usingDTable_internal_bmi2_asm_loop
            |          
             --16.40%--_HUF_decompress4X1_usingDTable_internal_bmi2_asm_loop

vasi / squashfuse

High-level squashfuse optimization opportunities #73