Closed JuniorJPDJ closed 3 years ago
My code attempts to only read what is needed. https://github.com/openscopeproject/ZipROFS/blob/dev/ziprofs.py#L130
If that is not behavior you are seeing then it's an issue within ZipFile lib.
Could you please check if file is stored (put in zip without compression) and use seek instead of reading that amount of data if possible? It could be biig killer on RAM usage, when used on big files. Also.. this is re-reading, a whole already read part of that file even if not needed. eg. if you read 1M of file, then next 2M chunk of data, it first loads first 1M of data, then loads 1M + 2M of data to cut first 1M and return just this 2M Kernel-level caching possibly COULD make not to redownload this part of file, but it doesn't also seem like good idea.
I added seek calls and better read performance (in terms of memory usage) when seek is not available but ultimately second level of caching is needed to keep opened ZipExtFile objects around to avoid repeated reads.
There's a problem with seekability of files inside python's zipfile
.
Every ZipExtFile
is seekable, but just with uncompressed this seeking is working like it should - I already had bad adventures with it.
Seek on compressed files is inaccurate, actual in-file position is other than reported one.
PS. Your new solution is much better, I personally like it :D
There are just few problems, but it's fixable ;)
Additional level of caching and cached_f.tell == read_offset
check would be next good solution for special cases, but now it's still much better than it was :D
I dont depend on reported position. Looking at ZipExtFile seek implementation it does the same thing I do with skipping chunks so it will be reliable at least on the first go.
I just tried to write test, which I've already done few days ago and it worked this time.. Need to find zip files I used when it failed, bcs this time it worked..
$ zip -r0 ../stored.zip .
$ zip -r9 ../compressed.zip .
$ mv ../stored.zip ../compressed.zip .
import zipfile
seek = 50000
read = 16
fname = '16 - Z Ostatniej Chwili.flac'
czf = zipfile.ZipFile('compressed.zip', 'r')
szf = zipfile.ZipFile('stored.zip', 'r')
with czf.open(fname, 'r') as f:
print(f.seek(seek))
print(f.read(read))
with szf.open(fname, 'r') as f:
print(f.seek(seek))
print(f.read(read))
with open(fname, 'rb') as f:
print(f.seek(seek))
print(f.read(read))
czf.close()
szf.close()
And it worked as it should:
50000
b'\xa9w,\xb4R)\x16\x92E\x13\x89\xd7+^MR'
50000
b'\xa9w,\xb4R)\x16\x92E\x13\x89\xd7+^MR'
50000
b'\xa9w,\xb4R)\x16\x92E\x13\x89\xd7+^MR'
EDIT: I found that zips, one of them was corrupted..
Also bad news for seeking, zipfile
doesn't allow just seeking inside stored file, it always read needed offset instead of just seeking..
Uh, I'll need to ask for fix in python stdlib..
Your solution is very good, but I'm still looking forward second level caching, it will make things much faster with current zipfile
implementation ;)
I added second level of caching, reads should be faster now. I can't do much about zipfile not having a seek shortcut for non-compressed archives.
I'll try to patch zipfile myself and push it to upstream, then your code should be much faster ;D You did your fix, thank you very much.
There's still some weirdness. If I pipe file from zip through pv - it still shows half of speed it should. On archivemount it was 4MBps on 4MBps network usage and there it's 2MBps on 4MBps network usage. It was the same without extfile caching. And there's one more issue, but I first need to debug it - vlc can't play audio without loading whole file, where it's working good without zip. When I pipe file through pv to vlc, it plays good.
EDIT: Video showing issue: https://pixeldrain.com/u/D7Vyxr1S
Also for ZipExtFile
cache you could use filehandle and release/open functions
I still haven't figured out issue with speed, but i found why it takes a while to load track on VLC:
DEBUG:ziprofs.fs:-> open /home/juniorjpdj/dev/tgmount/tgmount/dir/65 donGURALesko - Totem Lesnych Ludzi (2010).zip/donGURALesko - Totem Lesnych Ludzi (2010)/05. donGURALesko - Dzieci Kosmosu (prod. Matheo) feat. Dj Hen.flac (32770,)
DEBUG:ziprofs.fs:<- open 1
DEBUG:ziprofs.fs:-> read /home/juniorjpdj/dev/tgmount/tgmount/dir/65 donGURALesko - Totem Lesnych Ludzi (2010).zip/donGURALesko - Totem Lesnych Ludzi (2010)/05. donGURALesko - Dzieci Kosmosu (prod. Matheo) feat. Dj Hen.flac (4096, 20246528, 1)
DEBUG:ziprofs.fs:-> read /home/juniorjpdj/dev/tgmount/tgmount/dir/65 donGURALesko - Totem Lesnych Ludzi (2010).zip/donGURALesko - Totem Lesnych Ludzi (2010)/05. donGURALesko - Dzieci Kosmosu (prod. Matheo) feat. Dj Hen.flac (16384, 0, 1)
So it's issue with seeking and VLC. It loads almost whole file. MPlayer opens file instantly.
I also removed your second level caching and added filehandles for files. It should have the same functionality but in simple way. I'll do pull request when I'll be sure it works.
I debugged problem with double speed, now it works as it should (as far as I tested). Sometimes there were unanswered reads when next read happened, that made ZipROFS seek when other read was happening. Very bad race condition.. I didn't even know that fusepy is handling concurrency.
Probably we should also block simultaneous access to whole instance of ZipFile
, not only ZipExtFile
, but.. laater ;)
EDIT: Done
Also bad news for seeking,
zipfile
doesn't allow just seeking inside stored file, it always read needed offset instead of just seeking.. Uh, I'll need to ask for fix in python stdlib..I'll try to patch zipfile myself and push it to upstream, then your code should be much faster ;D
https://gist.github.com/JuniorJPDJ/aaaf134764348052021b4b9ec9046184 I patched it, posting here in case you'd like to use this ;) I'll try to upstream it later.
It's similar to this one: https://github.com/cybernoid/archivemount/issues/15#issuecomment-682332707