mxmlnkn / ratarmount

Access large archives as a filesystem efficiently, e.g., TAR, RAR, ZIP, GZ, BZ2, XZ, ZSTD archives
MIT License
709 stars 36 forks source link

Error reading from Python #114

Closed peteruhrig closed 1 year ago

peteruhrig commented 1 year ago

I'm using the Python sample code from the ratarmountcore page. The index that it's meant to read was created by the command line version of ratarmount without any special parameters, using ratarmount /path/to/file.tar.bz2 test. The first try told me that the parameters are different in terms of recursive, so I changed that from True to False in the Python code. But now I get this message:

python test_ratarmountcore.py
[Warning] The arguments used for creating the found index differ from the arguments
[Warning] given for mounting the archive now. In order to apply these changes,
[Warning] recreate the index using the --recreate-index option!
[Warning] gzipSeekPointSpacing: index: 16777216, current: 4194304
Contents of /:
Traceback (most recent call last):
  File "/home/hpc/xxxxx/test_ratarmountcore.py", line 8, in <module>
    with archive.open(info) as file:
  File "/home/hpc/xxxxx/ratarmountvenv/lib/python3.9/site-packages/ratarmountcore/SQLiteIndexedTar.py", line 1087, in open
    assert fileInfo.userdata
AttributeError: 'NoneType' object has no attribute 'userdata'

What I find particularly strange about this is that the error message says something about gzip, but the archive is using bzip2. Any pointers are welcome.

mxmlnkn commented 1 year ago

First off, these are only warning, so they can be ignored especially if you know that the difference like the gzip seek point spacing is irrelevant. The problem is that the default for the command line interface is 16 MiB while the default argument for ratarmountcore is 4 MiB. I see two fixes that should be implemented:

peteruhrig commented 1 year ago

Thanks! The message goes away when I set gzipSeekPointSpacing=16777216 in the call to rmc.open. I'm still not sure it is using the existing index, because it takes nearly 30 seconds to simply display the one directory entry in "/". Should I open a new issue for that?

mxmlnkn commented 1 year ago

Is the assertion error (assert fileInfo.userdata) also disappearing? I overlooked that one for my initial reply. That looks more problematic.

As to whether the index has been loaded or not, you can increase the verbosity with something like -d 3 or SQLiteIndexedTar( ..., printDebug = 3). If you are using the SQLiteIndexedTar class directly, you might have to specify it via the indexFilePath or ìndexFolders parameter. Are you doing that? When calling ratarmountcore.open, which is an alias for ratarmountcore.factory.openMountSource, the options are forwarded all the way to ratarmountcore.SQLiteIndexedTar.SQLiteIndexedTar.__init__, which has some documentation for the option. The writeIndex option also might have to be set to true. I guess some better defaults might be in order for the use-as-a-library use case. I mostly concentrated on the CLI until now. Does specifying these options solve your issue?

peteruhrig commented 1 year ago

The assertion error was due to me not understanding the code correctly and trying to open a directory as a file...

The index is loaded, but it still takes very long. The code I'm using is very simple:

import ratarmountcore as rmc

archive = rmc.open("/path/to/file.tar.bz2", recursive=False, gzipSeekPointSpacing=16777216, printDebug = 3)
print(archive.listDir("/"))

(BTW: The sample code contains the line archive.listDir("/"), which does not print anything. This also took me some time to figure out.) The script runs for roughly 30 seconds:

(ratarmountvenv) user@woody4:~$ time python test_ratarmountcore.py
[Info] Detected compression bz2 for file object: <_io.BufferedReader name='/path/to/file.tar.bz2'>
Successfully loaded offset dictionary from/path/to/file.tar.bz2.index.sqlite
{'samplefolder': FileInfo(size=0, mtime=1649024542.0, mode=16877, linkname='', uid=7777, gid=8888, userdata=[SQLiteIndexedTarUserData(offset=512, offsetheader=0, istar=0, issparse=0)])}

real    0m29.074s
user    0m28.789s
sys     0m0.057s

So it definitly finds the sqlite database. Specifying writeindex = True does not change anything. It still takes roughly 30 seconds, even when run multiple times in a row. Next step: delete the index and run the script again:

(ratarmountvenv) user@woody4:~$ time python test_ratarmountcore.py
[Info] Detected compression bz2 for file object: <_io.BufferedReader name='/path/to/file.tar.bz2'>
Creating new SQLite index database at /path/to/file.tar.bz2.index.sqlite
Creating offset dictionary for /path/to/file.tar.bz2 ...
[Info] Could not create a progress bar because the file size could not be queried.
Resorting files by path ...
Creating offset dictionary for /path/to/file.tar.bz2 took 34.51s
[Info] Could not load bz2 block offset data. Will create it from scratch.
no such table: bzip2blocks
Traceback (most recent call last):
  File "/home/hpc/b105dc/b105dc11/ratarmountvenv/lib/python3.9/site-packages/ratarmountcore/SQLiteIndex.py", line 866, in synchronizeCompressionOffsets
    offsets = dict(db.execute(f"SELECT blockoffset,dataoffset FROM {table_name};"))
sqlite3.OperationalError: no such table: bzip2blocks
Writing out TAR index to /path/to/file.tar.bz2.index.sqlite took 0s and is sized 41353216 B
{'samplefolder': FileInfo(size=0, mtime=1649024542.0, mode=16877, linkname='', uid=7777, gid=8888, userdata=[SQLiteIndexedTarUserData(offset=512, offsetheader=0, istar=0, issparse=0)])}

real    1m2.593s
user    1m1.886s
sys     0m0.335s

Running it again after this gets us back to the 30 seconds.

mxmlnkn commented 1 year ago

Could you please try adding the isGnuIncremental=False argument?

import ratarmountcore as rmc

archive = rmc.open("/path/to/file.tar.bz2", recursive=False, gzipSeekPointSpacing=16777216, printDebug = 3, isGnuIncremental=False)
print(archive.listDir("/"))

If it works, then again, it is a matter of unsuitable defaults.

peteruhrig commented 1 year ago

Yes, that does the trick:

real    0m0.413s
user    0m0.142s
sys     0m0.040s

Thanks for the help! But yes, this is sort of unexpected - I would never have suspected this argument.

mxmlnkn commented 1 year ago

It makes sense when you know about it. By default it is undetermined (None) and the only way to currently known to me to check for such an incremental TAR archive is by looking for special file members, which can take a while because it has to analyze the TAR archive for that. In order to avoid a second run over the file, it limits this heuristic to ~1k files. Then again, if the index has been created, this check shouldn't matter in the first place. So that's yet another bug to be fixed: