Closed peteruhrig closed 1 year ago
First off, these are only warning, so they can be ignored especially if you know that the difference like the gzip seek point spacing is irrelevant. The problem is that the default for the command line interface is 16 MiB while the default argument for ratarmountcore is 4 MiB. I see two fixes that should be implemented:
Thanks! The message goes away when I set gzipSeekPointSpacing=16777216
in the call to rmc.open.
I'm still not sure it is using the existing index, because it takes nearly 30 seconds to simply display the one directory entry in "/". Should I open a new issue for that?
Is the assertion error (assert fileInfo.userdata) also disappearing? I overlooked that one for my initial reply. That looks more problematic.
As to whether the index has been loaded or not, you can increase the verbosity with something like -d 3
or SQLiteIndexedTar( ..., printDebug = 3)
. If you are using the SQLiteIndexedTar
class directly, you might have to specify it via the indexFilePath
or ìndexFolders
parameter. Are you doing that? When calling ratarmountcore.open
, which is an alias for ratarmountcore.factory.openMountSource
, the options are forwarded all the way to ratarmountcore.SQLiteIndexedTar.SQLiteIndexedTar.__init__
, which has some documentation for the option. The writeIndex
option also might have to be set to true. I guess some better defaults might be in order for the use-as-a-library use case. I mostly concentrated on the CLI until now. Does specifying these options solve your issue?
The assertion error was due to me not understanding the code correctly and trying to open a directory as a file...
The index is loaded, but it still takes very long. The code I'm using is very simple:
import ratarmountcore as rmc
archive = rmc.open("/path/to/file.tar.bz2", recursive=False, gzipSeekPointSpacing=16777216, printDebug = 3)
print(archive.listDir("/"))
(BTW: The sample code contains the line archive.listDir("/")
, which does not print anything. This also took me some time to figure out.)
The script runs for roughly 30 seconds:
(ratarmountvenv) user@woody4:~$ time python test_ratarmountcore.py
[Info] Detected compression bz2 for file object: <_io.BufferedReader name='/path/to/file.tar.bz2'>
Successfully loaded offset dictionary from/path/to/file.tar.bz2.index.sqlite
{'samplefolder': FileInfo(size=0, mtime=1649024542.0, mode=16877, linkname='', uid=7777, gid=8888, userdata=[SQLiteIndexedTarUserData(offset=512, offsetheader=0, istar=0, issparse=0)])}
real 0m29.074s
user 0m28.789s
sys 0m0.057s
So it definitly finds the sqlite database. Specifying writeindex = True
does not change anything. It still takes roughly 30 seconds, even when run multiple times in a row.
Next step: delete the index and run the script again:
(ratarmountvenv) user@woody4:~$ time python test_ratarmountcore.py
[Info] Detected compression bz2 for file object: <_io.BufferedReader name='/path/to/file.tar.bz2'>
Creating new SQLite index database at /path/to/file.tar.bz2.index.sqlite
Creating offset dictionary for /path/to/file.tar.bz2 ...
[Info] Could not create a progress bar because the file size could not be queried.
Resorting files by path ...
Creating offset dictionary for /path/to/file.tar.bz2 took 34.51s
[Info] Could not load bz2 block offset data. Will create it from scratch.
no such table: bzip2blocks
Traceback (most recent call last):
File "/home/hpc/b105dc/b105dc11/ratarmountvenv/lib/python3.9/site-packages/ratarmountcore/SQLiteIndex.py", line 866, in synchronizeCompressionOffsets
offsets = dict(db.execute(f"SELECT blockoffset,dataoffset FROM {table_name};"))
sqlite3.OperationalError: no such table: bzip2blocks
Writing out TAR index to /path/to/file.tar.bz2.index.sqlite took 0s and is sized 41353216 B
{'samplefolder': FileInfo(size=0, mtime=1649024542.0, mode=16877, linkname='', uid=7777, gid=8888, userdata=[SQLiteIndexedTarUserData(offset=512, offsetheader=0, istar=0, issparse=0)])}
real 1m2.593s
user 1m1.886s
sys 0m0.335s
Running it again after this gets us back to the 30 seconds.
Could you please try adding the isGnuIncremental=False
argument?
import ratarmountcore as rmc
archive = rmc.open("/path/to/file.tar.bz2", recursive=False, gzipSeekPointSpacing=16777216, printDebug = 3, isGnuIncremental=False)
print(archive.listDir("/"))
If it works, then again, it is a matter of unsuitable defaults.
nMaxToTry
inside SQLiteIndexedTar._isGnuIncremental
even further for (bz2) compressed files. I think I might have finetuned the limit only for gzip, although in my case it "only" takes 3s. I'm not sure why it is so much slower on your system, maybe fewer cores, or maybe a different problem altogether. Is already "only" 1000 for compressed files...isGnuIncremental = False
the default? Seems like a sufficiently rarely used feature although I know of at least one user for which I added this.
--incremental
for listing, so I think it is fine to change the default to False.Yes, that does the trick:
real 0m0.413s
user 0m0.142s
sys 0m0.040s
Thanks for the help! But yes, this is sort of unexpected - I would never have suspected this argument.
It makes sense when you know about it. By default it is undetermined (None) and the only way to currently known to me to check for such an incremental TAR archive is by looking for special file members, which can take a while because it has to analyze the TAR archive for that. In order to avoid a second run over the file, it limits this heuristic to ~1k files. Then again, if the index has been created, this check shouldn't matter in the first place. So that's yet another bug to be fixed:
--incremental
or --listed-incremental=/dev/null
. For updating a .snar file is needed and has to be given into --listed-incremental=
but even for extracting or listing either --incremental
or --listed-incremental
has to be specified. That probably means there is no automatic detection for incremental files!
I'm using the Python sample code from the ratarmountcore page. The index that it's meant to read was created by the command line version of ratarmount without any special parameters, using
ratarmount /path/to/file.tar.bz2 test
. The first try told me that the parameters are different in terms of recursive, so I changed that from True to False in the Python code. But now I get this message:What I find particularly strange about this is that the error message says something about gzip, but the archive is using bzip2. Any pointers are welcome.