openscopeproject / ZipROFS

FUSE file system with transparent access to zip files as if they were folders.
MIT License
11 stars 5 forks source link

Is zipROFS slow? #9

Open christophgil opened 1 year ago

christophgil commented 1 year ago

ZipROFS is exactly what we need.

We have tons of mass spectrometry data, each record is a Zip containing several files. However, they are very large, some have several Gigabytes.

I measure the time for reading a file of 600MB
time nocache cat my-file >/dev/null

Mounting single zip using fuse-zip, this takes 3 sec, ZipROFS takes 6 min.

Is there a trick to make ZipROFS faster?

I tried blksize=8096, but this does not have any impact.

I downloaded the latest ZipROFS. I installed fusepy pip install fusepy -U Defaulting to user installation because normal site-packages is not writeable Requirement already satisfied: fusepy in ./.local/lib/python3.10/site-packages (3.0.1)

Many thanks for this nice software.

JuniorJPDJ commented 1 year ago

Try also ratarmount

qu1ck commented 1 year ago

I wrote this with "lots of zip files with lots of small files" use case in mind and it performs good enough for me. I haven't run any benchmarks on large files, but a difference of 2 orders of magnitude does not seem right.

I'll investigate.

qu1ck commented 1 year ago

I tested on a zip file with 2 random files 512MB each. dd can achieve 30MB/s reads

$ ls -lah
total 0
dr--r--r-- 1 quick quick 1.1G Jan 31 12:10 .
drwxr-xr-x 2 quick quick 4.0K Jan 31 12:10 ..
-r-xr-xr-x 1 quick quick 512M Jan 31 12:09 random2.data
-r-xr-xr-x 1 quick quick 512M Jan 31 12:08 random.data

$ dd if=./random2.data  of=/dev/null bs=1M status=progress
510656512 bytes (511 MB, 487 MiB) copied, 16 s, 31.9 MB/s
512+0 records in
512+0 records out
536870912 bytes (537 MB, 512 MiB) copied, 16.8555 s, 31.9 MB/s

$ nocache dd if=./random2.data  of=/dev/null bs=1M status=progress
507510784 bytes (508 MB, 484 MiB) copied, 16 s, 31.7 MB/s
512+0 records in
512+0 records out
536870912 bytes (537 MB, 512 MiB) copied, 17.0205 s, 31.5 MB/s

cat is significantly slower on the same file at ~3MB/s average. It starts at ~10 and by the end of file slows to a crawl of < 1MB/s.

Looking at debug logs, reason becomes obvious: cat does out of order reads somehow. Instead of reading each chunk strictly one after the other like dd it sometimes reads ahead and then seeks back to previous chunk and jumps around like that. This destroys zip read performance as each seek needs full file read from the start, that's just how zip works.

It's possible that it's not cat's "fault" but there is some reordering of syscalls happening in fuse or fusepy or python itself is not handling the calls in order.

Either way I suggest using a program with strong sequence guarantees like dd to copy the file to a ramdisk or more suitable location before you use it in something that needs fast random access performance. 30MB/s is not particularly fast but it's a lot better than 3MB/s.

Better performance for sequential reads would need a reimplementation in C/C++/Rust.

christophgil commented 1 year ago

Cool, thank you.

On Tue, Jan 31, 2023 at 6:39 PM qu1ck @.***> wrote:

I wrote this with "lots of zip files with lots of small files" use case in mind and it performs good enough for me. I haven't run any benchmarks on large files, but a difference of 2 orders of magnitude does not seem right.

I'll investigate.

— Reply to this email directly, view it on GitHub https://github.com/openscopeproject/ZipROFS/issues/9#issuecomment-1410885561, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASRZU6AYDWPM2A4WJCIVAMTWVFL5RANCNFSM6AAAAAAUMLVTNU . You are receiving this because you authored the thread.Message ID: @.***>

christophgil commented 1 year ago

You are right, I tried checksum computation and the performance is much better. 14 sec vs 3 sec for ZipROFS and fuse-zip respectively. This would be sufficient. I need to check how fast our programs read the data and maybe ZipROFS is the way to go. rhash --printf='%C' --crc32 I wonder whether one could increase blocksize or tune with other parameters.

On Tue, Jan 31, 2023 at 9:09 PM qu1ck @.***> wrote:

I tested on a zip file with 2 random files 512MB each. dd can achieve 30MB/s reads

$ ls -lah total 0 dr--r--r-- 1 quick quick 1.1G Jan 31 12:10 . drwxr-xr-x 2 quick quick 4.0K Jan 31 12:10 .. -r-xr-xr-x 1 quick quick 512M Jan 31 12:09 random2.data -r-xr-xr-x 1 quick quick 512M Jan 31 12:08 random.data

$ dd if=./random2.data of=/dev/null bs=1M status=progress 510656512 bytes (511 MB, 487 MiB) copied, 16 s, 31.9 MB/s 512+0 records in 512+0 records out 536870912 bytes (537 MB, 512 MiB) copied, 16.8555 s, 31.9 MB/s

$ nocache dd if=./random2.data of=/dev/null bs=1M status=progress 507510784 bytes (508 MB, 484 MiB) copied, 16 s, 31.7 MB/s 512+0 records in 512+0 records out 536870912 bytes (537 MB, 512 MiB) copied, 17.0205 s, 31.5 MB/s

cat is significantly slower on the same file at ~3MB/s average. It starts at ~10 and by the end of file slows to a crawl of < 1MB/s.

Looking at debug logs, reason becomes obvious: cat does out of order reads somehow. Instead of reading each chunk strictly one after the other like dd it sometimes reads ahead and then seeks back to previous chunk and jumps around like that. This destroys zip read performance as each seek needs full file read from the start, that's just how zip works.

It's possible that it's not cat's "fault" but there is some reordering of syscalls happening in fuse or fusepy or python itself is not handling the calls in order.

Either way I suggest using a program with strong sequence guarantees like dd to copy the file to a ramdisk or more suitable location before you use it in something that needs fast random access performance. 30MB/s is not particularly fast but it's a lot better than 3MB/s.

Better performance for sequential reads would need a reimplementation in C/C++/Rust.

— Reply to this email directly, view it on GitHub https://github.com/openscopeproject/ZipROFS/issues/9#issuecomment-1411073556, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASRZU6FBJURTXYVSWY4D7PDWVF5P3ANCNFSM6AAAAAAUMLVTNU . You are receiving this because you authored the thread.Message ID: @.***>

qu1ck commented 1 year ago

I found that it was actually FUSE that was reordering reads because async reads are enabled by default when they are supported by kernel.

It took some monkeypatching to disable this since fusepy does not support passing requested capability flags at the moment. But with latest commit I was able to bring cat performance to the same level, I get ~28MB/s now.

Async reads can still be enabled by passing async option to ziprofs.

Try the latest version.

JuniorJPDJ commented 1 year ago

You may need my zipfile patches applied when accessing uncompressed zip files. Those make random reads muuuch better: https://github.com/python/cpython/pull/27737

qu1ck commented 1 year ago

That will help too but looking at underlying lib sources there is just too much memory copying going around to achieve close to native speeds. I found at least 2 places where read buffer is copied (once in ZipFileExt, once in fusepy).

To get somewhat close to native read speeds this needs to be reimplemented in a language that is not afraid of pointers.

christophgil commented 1 year ago

Yes indeed, the biggest file in our zips is uncompressed. Thanks JunioJPDJ

A first test indicates that it works with our software which is a Windows exe using closed source dlls for file loading. It runs via Wine. It accepts folders ending with ".d". I solved this by creating empty folders and symlinking the required files into it.

I tried to understand the python code. I want to change it such that files ending with ".d.Zip" would give rise to folders ending with .d but could not yet find out how. I haven't done Python for 20 years, however your code looks beautiful and understandable.

Thank you for this great file system which will make things so much easier for us.

On Wed, Feb 1, 2023 at 4:45 AM qu1ck @.***> wrote:

That will help too but looking at underlying lib sources there is just too much memory copying going around to achieve close to native speeds. I found at least 2 places where read buffer is copied (once in ZipFileExt, once in fusepy).

To get somewhat close to native read speeds this needs to be reimplemented in a language that is not afraid of pointers.

— Reply to this email directly, view it on GitHub https://github.com/openscopeproject/ZipROFS/issues/9#issuecomment-1411456663, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASRZU6FBSGTQ6AUCEDCNIIDWVHS5VANCNFSM6AAAAAAUMLVTNU . You are receiving this because you authored the thread.Message ID: @.***>

qu1ck commented 1 year ago

Changing what name folder corresponding to zip files will have in mounted tree is not trivial. Assumption that any path containing *.zip as one of it's element is actually a path within a zip and not a real path in underlying fs is baked in pretty deep.

If you want to change that you need to look at get_zip_path() function. Currently it returns a prefix of the path to zip file if it's a path within a zip file and None otherwise.

You would need to change the logic of get_zip_path() and every caller of it because previously zip file path was a strict prefix but with the modified paths it will not be, example: "/path/to/some.d/file.txt" may actually be a file.txt inside "/path/to/some.d.zip".

christophgil commented 1 year ago

Thank you - this is what I was already assuming.

Either I leave it the way it is and live with symlink creation or I will make a magic prefix like my_file.d.Zip -> ZiP_myfile.d I would take ZiP.*.d as a recognition pattern. This would work - wouldn't it? Cheers C

On Fri, Feb 3, 2023 at 12:25 AM qu1ck @.***> wrote:

Changing what name folder corresponding to zip files will have in mounted tree is not trivial. Assumption that any path containing *.zip as one of it's element is actually a path within a zip and not a real path in underlying fs is baked in pretty deep.

If you want to change that you need to look at get_zip_path() function. Currently it returns a prefix of the path to zip file if it's a path within a zip file and None otherwise.

You would need to change the logic of get_zip_path() and every caller of it because previously zip file path was a strict prefix but with the modified paths it will not be, example: "/path/to/some.d/file.txt" may actually be a file.txt inside "/path/to/some.d.zip".

— Reply to this email directly, view it on GitHub https://github.com/openscopeproject/ZipROFS/issues/9#issuecomment-1414547316, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASRZU6GNTXG27GXVJA6CL53WVRGA5ANCNFSM6AAAAAAUMLVTNU . You are receiving this because you authored the thread.Message ID: @.***>

qu1ck commented 1 year ago

Yes, it would but complexity is about the same. You still have to add translation logic from real path to virtual path and vice versa.

Maybe what would be easier for your case is to drop the .zip extension from your files and just name your archives as whatever.d. As long as they are still in zip format I think it would be a one line change in ZipROFS to treat them as zip, just change the condition here https://github.com/openscopeproject/ZipROFS/blob/dev/ziprofs.py#L110

to be if path[-2:] == ".d" and ...

christophgil commented 1 year ago

Seems that patching was successful It is now possible to alter the folder names.

Implementation: Starting from the fork JuniorJPDJ https://github.com/JuniorJPDJ / ZipROFS https://github.com/JuniorJPDJ/ZipROFS, I introduced two methods which can be customized to user's needs. def zipfilename_real_to_virtual(name: str): def zippath_virtual_to_real(path: str):

After testing on real data, I will upload my version. Thanks so much for your great help.

On Fri, Feb 3, 2023 at 9:43 AM qu1ck @.***> wrote:

Yes, it would but complexity is about the same. You still have to add translation logic from real path to virtual path and vice versa.

Maybe what would be easier for your case is to drop the .zip extension from your files and just name your archives as whatever.d. As long as they are still in zip format I think it would be a one line change in ZipROFS to treat them as zip, just change the condition here https://github.com/openscopeproject/ZipROFS/blob/dev/ziprofs.py#L110

to be if path[-2:] == ".d" and ...

— Reply to this email directly, view it on GitHub https://github.com/openscopeproject/ZipROFS/issues/9#issuecomment-1415495892, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASRZU6COCKNW47LICYIOLZTWVTHNBANCNFSM6AAAAAAUMLVTNU . You are receiving this because you authored the thread.Message ID: @.***>

JuniorJPDJ commented 1 year ago

Don't use my fork of ZipROFS - it is out of date and merged to branches here. Try using python 3.12 or get zipfile.py library from it and copy near your software.

christophgil commented 1 year ago

Ok, thanks.

On Mon, Feb 6, 2023 at 9:59 AM JuniorJPDJ @.***> wrote:

Don't use my fork of ZipROFS - it is out of date and merged to branches here.

— Reply to this email directly, view it on GitHub https://github.com/openscopeproject/ZipROFS/issues/9#issuecomment-1418812025, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASRZU6DPACSND2SLOMRBZ6LWWDDOVANCNFSM6AAAAAAUMLVTNU . You are receiving this because you authored the thread.Message ID: @.***>

christophgil commented 1 year ago

I am running Python 3.12.0a4+ from the Github. The respective Zip entries are not compressed. Ziprofs is about 5 times slower compared to mounting with zip-fuse in our data analysis.

Minor Bug: In the function def getattr(self, path, fh=None): After except KeyError: There should be a return of maybe an empty hash-map. "None" was not working. return {} works. Otherwise non-existing zip entries are reported as existing files.

There is a performance bug; The function read(self, path, size, offset, fh) Reads byte arrays. In my case the size is 128kb. Each time it creates a new byte array object which needs to be garbage collected eventually.

Suggestion: Take a reusable byte buffer as function argument and return the number of bytes read.

In the near future we will start computations with 1TB to 2TB of data. The current performance is too slow

Do you see any chance to improve it? Where is the performance lost?

Best C

On Mon, Feb 6, 2023 at 11:51 AM christoph gille @.***> wrote:

Ok, thanks.

On Mon, Feb 6, 2023 at 9:59 AM JuniorJPDJ @.***> wrote:

Don't use my fork of ZipROFS - it is out of date and merged to branches here.

— Reply to this email directly, view it on GitHub https://github.com/openscopeproject/ZipROFS/issues/9#issuecomment-1418812025, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASRZU6DPACSND2SLOMRBZ6LWWDDOVANCNFSM6AAAAAAUMLVTNU . You are receiving this because you authored the thread.Message ID: @.***>

qu1ck commented 1 year ago

Minor Bug: In the function def getattr(self, path, fh=None): After except KeyError: There should be a return of maybe an empty hash-map. "None" was not working. return {} works. Otherwise non-existing zip entries are reported as existing files.

Can you provide specific example that results in incorrect representation? I think confusion here is that zip file format can have files in subdirectories without those subdirectories themselves existing as explicit entries in the archive. For example there can be a file entry with path /a/b inside the zip without a dir entry /a and it's valid and ziprofs handles such cases correctly.

The function read(self, path, size, offset, fh)

Take a reusable byte buffer as function argument and return the number of bytes read.

Unfortunately it's impossible with current ZipFile API and fusepy. On the one hand ZipFIle doesn't support reading into a buffer, only read() which creates it's own buffer every time. On the other hand fusepy doesn't give you a real pythonic buffer to write into, only a c pointer which ctypes knows how to memmove() into but it's not a real python object.

So there is no way to reuse a buffer here as far as I can tell.

Like I said in comment above, main performance penalties here are coming from copying memory around multiple times and there is no way around that without implementing your own ctypes based zip library. At that point it makes sense to just rewrite ZipROFS in C with libzip or libarchive and native FUSE.

qu1ck commented 1 year ago

In the near future we will start computations with 1TB to 2TB of data.

Sorry but what's the point of having such large files in uncompressed zip archives? How large is each file individually?

There is one thing that can be done to improve performance dramatically if you have RAM to spare: cache the file fully in memory and then reads and seeks will be blazing fast. To avoid dynamic memory allocations the buffer can be static, pre allocated to the largest file you expect to have. Should work pretty well if your read pattern doesn't involve many file handles being open at the same time.

christophgil commented 1 year ago

Thanks for your fast response!

!/usr/bin/env bash

D=~/test/ziprofs_bug M=~/test/ziprofs_mnt mkdir -p $D $M cd $D echo 'hello world' > file.txt zip file.zip file.txt umount $M python3 /local/filesystem/ZipROFS-dev/ziprofs.py $D $M -o cachesize=2048 tree $M [[ -e $M/file.zip/file.txt ]] && echo "This should exist" [[ -e $M/file.zip/file.txtXXXXXXXXXX ]] && echo "This should NOT exist"

######## OUTPUT ######### /home/x/test/ziprofs_mnt ├── file.txt └── file.zip └── file.txt

1 directory, 2 files This should exist This should NOT exist

Any like file.zip/file.txtXXXXXXXXXX is a directory. I can cd into it. Cheers C

On Thu, Feb 9, 2023 at 2:01 AM qu1ck @.***> wrote:

Minor Bug: In the function def getattr(self, path, fh=None): After except KeyError: There should be a return of maybe an empty hash-map. "None" was not working. return {} works. Otherwise non-existing zip entries are reported as existing files.

Can you provide specific example that results in incorrect representation? I think confusion here is that zip file format can have files in subdirectories without those subdirectories themselves existing as explicit entries in the archive. For example there can be a file entry with path /a/b inside the zip without a dir entry /a and it's valid and ziprofs handles such cases correctly.

The function read(self, path, size, offset, fh)

Take a reusable byte buffer as function argument and return the number of bytes read.

Unfortunately it's impossible with current ZipFile API and fusepy. On the one hand ZipFIle doesn't support reading into a buffer, only read() which creates it's own buffer every time. On the other hand fusepy doesn't give you a real pythonic buffer to write into, only a c pointer which ctypes knows how to memmove() into but it's not a real python object.

So there is no way to reuse a buffer here as far as I can tell.

Like I said in comment above https://github.com/openscopeproject/ZipROFS/issues/9#issuecomment-1411456663, main performance penalties here are coming from copying memory around multiple times and there is no way around that without implementing your own ctypes based zip library. At that point it makes sense to just rewrite ZipROFS in C with libzip or libarchive and native FUSE.

— Reply to this email directly, view it on GitHub https://github.com/openscopeproject/ZipROFS/issues/9#issuecomment-1423501206, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASRZU6EISHLGT6VB4R4MM4LWWRFX3ANCNFSM6AAAAAAUMLVTNU . You are receiving this because you authored the thread.Message ID: @.***>

qu1ck commented 1 year ago

Thanks, fixed in 4618612

qu1ck commented 1 year ago

I have some good news:

$ nocache dd if=benchmark/mnt/random.zip/random2.data of=/dev/null bs=1M status=progress
503316480 bytes (503 MB, 480 MiB) copied, 3 s, 168 MB/s
512+0 records in
512+0 records out
536870912 bytes (537 MB, 512 MiB) copied, 3.29076 s, 163 MB/s

After digging deep with ctypes and stuff trying to optimize memory movements I discovered by accident that a lot of time was spent in LoggingMixIn trying to repr() byte buffers even when logging was not enabled. 9f6f0c0 fixes that, ZipROFS is now 5 times faster.

christophgil commented 1 year ago

I set it up and it runs with our software when I use python 3.11.1. It does not work with latest python from github. I get EOFError.

Cool, from command line it is pretty fast. Only 2 or 3 times slower than mount_zip or zip-fuse. Maybe this is attributed to the object creation in read( ). I guess one can fix this with some C code.

I found the reason why our particular software is extremely slow. It is closed source software running in Wine. For each data record read by the software, the ziprofs function getattr(self, path, fh=None) is called very often (here 409000 times). It is always the same non-existing path. Normally, a data record is read within 30 sec but with ziprofs it takes several minutes. There is apparently an error in the software, which we cannot fix. The file system correctly reports existence of files from bash.

Throwing that many exceptions is probably very expensive and accounts for the long runtime. And the method dispatch of fuse.py fuse_operations( ) is probably expensive too. So I probably need to sense the two non existing file names early in fuse.py.

Where is the best place to tell early that a file does not exist? getattr? create? I would just check whether the path ends with this non-existing entry and then return something to indicate that the file does not exist.

b.t.w. I reported a wrong reference to self from a static function in fuse.py in github. Is such wrong reference detected by static code analyzers? Sorry, I am new to Python and I am more experienced with compiled rather than dynamic computer languages. I know R which has some concepts in common with python. Best C

On Fri, Feb 10, 2023 at 7:32 AM qu1ck @.***> wrote:

I have some good news:

$ nocache dd if=benchmark/mnt/random.zip/random2.data of=/dev/null bs=1M status=progress 503316480 bytes (503 MB, 480 MiB) copied, 3 s, 168 MB/s 512+0 records in 512+0 records out 536870912 bytes (537 MB, 512 MiB) copied, 3.29076 s, 163 MB/s

After digging deep with ctypes and stuff trying to optimize memory movements I discovered by accident that a lot of time was spent in LoggingMixIn trying to repr() byte buffers even when logging was not enabled. 9f6f0c0 https://github.com/openscopeproject/ZipROFS/commit/9f6f0c04e17a0831b3f46d26e286de58c8aae3f6 fixes that, ZipROFS is now 5 times faster.

— Reply to this email directly, view it on GitHub https://github.com/openscopeproject/ZipROFS/issues/9#issuecomment-1425312716, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASRZU6DXFMGQXM5J4AJVAKTWWXVHTANCNFSM6AAAAAAUMLVTNU . You are receiving this because you authored the thread.Message ID: @.***>

qu1ck commented 1 year ago

Cool, from command line it is pretty fast. Only 2 or 3 times slower than mount_zip or zip-fuse. Maybe this is attributed to the object creation in read( ). I guess one can fix this with some C code.

I did a hack together a way to directly use read1() for minimal buffer creation and then memmove() it directly into FUSE read buffer without any additional allocations. The result was unnoticeable. So it's probably not worth it to dig into sequential read speeds anymore.

Throwing that many exceptions is probably very expensive and accounts for the long runtime.

Yes, that is definitely a bottleneck if your app does a lot of such calls. I'll see if this can be improved but it will require more fusepy hacks.

Where is the best place to tell early that a file does not exist? getattr? create?

If you want to avoid exception throwing/catching overhead then in FUSE._wrapper()

b.t.w. I reported a wrong reference to self from a static function in fuse.py in github. Is such wrong reference detected by static code analyzers?

It is. fusepy does not seem to be maintained honestly. It still uses old FUSE2 and there are a bunch of stale merge requests on their github.

christophgil commented 1 year ago

I committed my changes to my fork - if you like them you may merge in my adds. They are markded with @CG

I have not yet tracked down the error with current github-version of python.

I try to get an improved Version of the dll that is doing the frequent access to non-existing files.

Best

C

On Sun, Feb 12, 2023 at 4:24 AM qu1ck @.***> wrote:

Cool, from command line it is pretty fast. Only 2 or 3 times slower than mount_zip or zip-fuse. Maybe this is attributed to the object creation in read( ). I guess one can fix this with some C code.

I did a hack together a way to directly use read1() for minimal buffer creation and then memmove() it directly into FUSE read buffer without any additional allocations. The result was unnoticeable. So it's probably not worth it to dig into sequential read speeds anymore.

Throwing that many exceptions is probably very expensive and accounts for the long runtime.

Yes, that is definitely a bottleneck if your app does a lot of such calls. I'll see if this can be improved but it will require more fusepy hacks.

Where is the best place to tell early that a file does not exist? getattr? create?

If you want to avoid exception throwing/catching overhead then in FUSE._wrapper()

b.t.w. I reported a wrong reference to self from a static function in fuse.py in github. Is such wrong reference detected by static code analyzers?

It is. fusepy does not seem to be maintained honestly. It still uses old FUSE2 and there are a bunch of stale merge requests on their github.

— Reply to this email directly, view it on GitHub https://github.com/openscopeproject/ZipROFS/issues/9#issuecomment-1426939792, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASRZU6BJIHYGSTU5XJ5GHW3WXBQZBANCNFSM6AAAAAAUMLVTNU . You are receiving this because you authored the thread.Message ID: @.***>

christophgil commented 1 year ago

Hi,

Our software which is accessing the files inside the zip has the option to use different threads. Interestingly, I can observe this in the read() function of ziprofs. With only one thread, the next offset is the previous plus the size in which case the seek command may not be required since the file position is already the offset. However, with many threads, I observe that the offsets seem irregular. Only when I also print the thread id, it makes sense. By playing with Ziprofs, one can learn quite a lot about file systems in general.

Again, thanks for this nice software. C

On Sun, Feb 12, 2023 at 4:24 AM qu1ck @.***> wrote:

Cool, from command line it is pretty fast. Only 2 or 3 times slower than mount_zip or zip-fuse. Maybe this is attributed to the object creation in read( ). I guess one can fix this with some C code.

I did a hack together a way to directly use read1() for minimal buffer creation and then memmove() it directly into FUSE read buffer without any additional allocations. The result was unnoticeable. So it's probably not worth it to dig into sequential read speeds anymore.

Throwing that many exceptions is probably very expensive and accounts for the long runtime.

Yes, that is definitely a bottleneck if your app does a lot of such calls. I'll see if this can be improved but it will require more fusepy hacks.

Where is the best place to tell early that a file does not exist? getattr? create?

If you want to avoid exception throwing/catching overhead then in FUSE._wrapper()

b.t.w. I reported a wrong reference to self from a static function in fuse.py in github. Is such wrong reference detected by static code analyzers?

It is. fusepy does not seem to be maintained honestly. It still uses old FUSE2 and there are a bunch of stale merge requests on their github.

— Reply to this email directly, view it on GitHub https://github.com/openscopeproject/ZipROFS/issues/9#issuecomment-1426939792, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASRZU6BJIHYGSTU5XJ5GHW3WXBQZBANCNFSM6AAAAAAUMLVTNU . You are receiving this because you authored the thread.Message ID: @.***>

christophgil commented 1 year ago

It seems to be a threading problem. Choosing one thread in our software (instead of previously 40, we have 128 CPU cores), it ran through.

On Sun, Feb 12, 2023 at 5:24 AM qu1ck @.***> wrote:

Cool, from command line it is pretty fast. Only 2 or 3 times slower than mount_zip or zip-fuse. Maybe this is attributed to the object creation in read( ). I guess one can fix this with some C code.

I did a hack together a way to directly use read1() for minimal buffer creation and then memmove() it directly into FUSE read buffer without any additional allocations. The result was unnoticeable. So it's probably not worth it to dig into sequential read speeds anymore.

Throwing that many exceptions is probably very expensive and accounts for the long runtime.

Yes, that is definitely a bottleneck if your app does a lot of such calls. I'll see if this can be improved but it will require more fusepy hacks.

Where is the best place to tell early that a file does not exist? getattr? create?

If you want to avoid exception throwing/catching overhead then in FUSE._wrapper()

b.t.w. I reported a wrong reference to self from a static function in fuse.py in github. Is such wrong reference detected by static code analyzers?

It is. fusepy does not seem to be maintained honestly. It still uses old FUSE2 and there are a bunch of stale merge requests on their github.

— Reply to this email directly, view it on GitHub https://github.com/openscopeproject/ZipROFS/issues/9#issuecomment-1426939792, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASRZU6BJIHYGSTU5XJ5GHW3WXBQZBANCNFSM6AAAAAAUMLVTNU . You are receiving this because you authored the thread.Message ID: @.***>

EarBiteR commented 4 months ago

If I switch to a directory (in the ziprofs mount) with hundreds of zips it takes a long time to show the contents as it seems to me that its reading the zip structure of every file or doing something. Should it only do that after the zip has been opened? I dont see a way to tune for this so am asking. Seems to be similar to issue #2 but this is local disk and not network. Also happens in Nemo (cinnamon desktop file manager). I can sit back and watch it giving file counts on the folders every few minutes. Once cached its fast!

Thanks :-)

qu1ck commented 4 months ago

If your shell does any sort of prefetching like checking for attributes (Nemo definitely does this) then ziprofs will have to read the zip files, it is expected to be slow with lots of files.

I suppose there could be an option to just trust the ".zip" extension and not try to check if the zip is actually an archive, it would speed up some things at the cost of having false positives with bad files.

EarBiteR commented 4 months ago

I looked to see if there was a way to disable prefetch for BASH but didnt find any way. I would be fine with trusting that a .zip is a zip. Since its a RO FS no harm if its a bad file.

qu1ck commented 4 months ago

Bash itself is not doing anything but if you have any prompt modifying extensions or something like that that would be where to look. Also by "switch" do you mean simply cd to that dir is slow? Or is ls slow? Because the latter is definitely reading files.

qu1ck commented 4 months ago

I added "nozipcheck" mount option, use that if you want to skip full file read when it's possible.

EarBiteR commented 4 months ago

Using -o nozipcheck its BLAZING FAST now opening folder of zips!!! Its instantaneous moving between directories in Zenity and Nemo.

THANK YOU so much!!

Its sad that this functionality is not built in to Linux filesystems. But qu1ck, with your work, makes it possible!