openscopeproject / ZipROFS

FUSE file system with transparent access to zip files as if they were folders.
MIT License
11 stars 5 forks source link

It somehow reads whole file from ZIP even if only small part needed #1

Closed JuniorJPDJ closed 3 years ago

JuniorJPDJ commented 3 years ago

It's similar to this one: https://github.com/cybernoid/archivemount/issues/15#issuecomment-682332707

qu1ck commented 3 years ago

My code attempts to only read what is needed. https://github.com/openscopeproject/ZipROFS/blob/dev/ziprofs.py#L130

If that is not behavior you are seeing then it's an issue within ZipFile lib.

JuniorJPDJ commented 3 years ago

Could you please check if file is stored (put in zip without compression) and use seek instead of reading that amount of data if possible? It could be biig killer on RAM usage, when used on big files. Also.. this is re-reading, a whole already read part of that file even if not needed. eg. if you read 1M of file, then next 2M chunk of data, it first loads first 1M of data, then loads 1M + 2M of data to cut first 1M and return just this 2M Kernel-level caching possibly COULD make not to redownload this part of file, but it doesn't also seem like good idea.

qu1ck commented 3 years ago

I added seek calls and better read performance (in terms of memory usage) when seek is not available but ultimately second level of caching is needed to keep opened ZipExtFile objects around to avoid repeated reads.

JuniorJPDJ commented 3 years ago

There's a problem with seekability of files inside python's zipfile. Every ZipExtFile is seekable, but just with uncompressed this seeking is working like it should - I already had bad adventures with it. Seek on compressed files is inaccurate, actual in-file position is other than reported one.

JuniorJPDJ commented 3 years ago

PS. Your new solution is much better, I personally like it :D There are just few problems, but it's fixable ;) Additional level of caching and cached_f.tell == read_offset check would be next good solution for special cases, but now it's still much better than it was :D

qu1ck commented 3 years ago

I dont depend on reported position. Looking at ZipExtFile seek implementation it does the same thing I do with skipping chunks so it will be reliable at least on the first go.

JuniorJPDJ commented 3 years ago

I just tried to write test, which I've already done few days ago and it worked this time.. Need to find zip files I used when it failed, bcs this time it worked..

$ zip -r0 ../stored.zip .
$ zip -r9 ../compressed.zip .
$ mv ../stored.zip ../compressed.zip .
import zipfile

seek = 50000
read = 16
fname = '16 - Z Ostatniej Chwili.flac'

czf = zipfile.ZipFile('compressed.zip', 'r')
szf = zipfile.ZipFile('stored.zip', 'r')

with czf.open(fname, 'r') as f:
    print(f.seek(seek))
    print(f.read(read))

with szf.open(fname, 'r') as f:
    print(f.seek(seek))
    print(f.read(read))

with open(fname, 'rb') as f:
    print(f.seek(seek))
    print(f.read(read))

czf.close()
szf.close()

And it worked as it should:

50000
b'\xa9w,\xb4R)\x16\x92E\x13\x89\xd7+^MR'
50000
b'\xa9w,\xb4R)\x16\x92E\x13\x89\xd7+^MR'
50000
b'\xa9w,\xb4R)\x16\x92E\x13\x89\xd7+^MR'

EDIT: I found that zips, one of them was corrupted..

JuniorJPDJ commented 3 years ago

Also bad news for seeking, zipfile doesn't allow just seeking inside stored file, it always read needed offset instead of just seeking.. Uh, I'll need to ask for fix in python stdlib.. Your solution is very good, but I'm still looking forward second level caching, it will make things much faster with current zipfile implementation ;)

qu1ck commented 3 years ago

I added second level of caching, reads should be faster now. I can't do much about zipfile not having a seek shortcut for non-compressed archives.

JuniorJPDJ commented 3 years ago

I'll try to patch zipfile myself and push it to upstream, then your code should be much faster ;D You did your fix, thank you very much.

JuniorJPDJ commented 3 years ago

There's still some weirdness. If I pipe file from zip through pv - it still shows half of speed it should. On archivemount it was 4MBps on 4MBps network usage and there it's 2MBps on 4MBps network usage. It was the same without extfile caching. And there's one more issue, but I first need to debug it - vlc can't play audio without loading whole file, where it's working good without zip. When I pipe file through pv to vlc, it plays good.

EDIT: Video showing issue: https://pixeldrain.com/u/D7Vyxr1S

JuniorJPDJ commented 3 years ago

Also for ZipExtFile cache you could use filehandle and release/open functions

JuniorJPDJ commented 3 years ago

I still haven't figured out issue with speed, but i found why it takes a while to load track on VLC:

DEBUG:ziprofs.fs:-> open /home/juniorjpdj/dev/tgmount/tgmount/dir/65 donGURALesko - Totem Lesnych Ludzi (2010).zip/donGURALesko - Totem Lesnych Ludzi (2010)/05. donGURALesko - Dzieci Kosmosu (prod. Matheo) feat. Dj Hen.flac (32770,)
DEBUG:ziprofs.fs:<- open 1
DEBUG:ziprofs.fs:-> read /home/juniorjpdj/dev/tgmount/tgmount/dir/65 donGURALesko - Totem Lesnych Ludzi (2010).zip/donGURALesko - Totem Lesnych Ludzi (2010)/05. donGURALesko - Dzieci Kosmosu (prod. Matheo) feat. Dj Hen.flac (4096, 20246528, 1)
DEBUG:ziprofs.fs:-> read /home/juniorjpdj/dev/tgmount/tgmount/dir/65 donGURALesko - Totem Lesnych Ludzi (2010).zip/donGURALesko - Totem Lesnych Ludzi (2010)/05. donGURALesko - Dzieci Kosmosu (prod. Matheo) feat. Dj Hen.flac (16384, 0, 1)

So it's issue with seeking and VLC. It loads almost whole file. MPlayer opens file instantly.

I also removed your second level caching and added filehandles for files. It should have the same functionality but in simple way. I'll do pull request when I'll be sure it works.

JuniorJPDJ commented 3 years ago

I debugged problem with double speed, now it works as it should (as far as I tested). Sometimes there were unanswered reads when next read happened, that made ZipROFS seek when other read was happening. Very bad race condition.. I didn't even know that fusepy is handling concurrency.

JuniorJPDJ commented 3 years ago

Probably we should also block simultaneous access to whole instance of ZipFile, not only ZipExtFile, but.. laater ;) EDIT: Done

JuniorJPDJ commented 3 years ago

Also bad news for seeking, zipfile doesn't allow just seeking inside stored file, it always read needed offset instead of just seeking.. Uh, I'll need to ask for fix in python stdlib..

I'll try to patch zipfile myself and push it to upstream, then your code should be much faster ;D

https://gist.github.com/JuniorJPDJ/aaaf134764348052021b4b9ec9046184 I patched it, posting here in case you'd like to use this ;) I'll try to upstream it later.