mxmlnkn / ratarmount

Access large archives as a filesystem efficiently, e.g., TAR, RAR, ZIP, GZ, BZ2, XZ, ZSTD archives
MIT License
697 stars 36 forks source link

SQLIte database is created in memory #85

Open Vadiml1024 opened 2 years ago

Vadiml1024 commented 2 years ago

When using -r option I'm getting log messages about SQLite databases created in :memory: even when --index-folder is specified. Is it normal?

Btw there is separate db for each .tar file - wouldn't be more efficient to have only one db?

mxmlnkn commented 2 years ago

Yeah, this is normal, at least for now. The problem is under which name to store the data for recursive archives. I think this issue might be a duplicate of #79.

Vadiml1024 commented 2 years ago

What about storing all .tar indices in single DB?

On Wed, May 25, 2022 at 2:41 PM Maximilian Knespel @.***> wrote:

Yeah, this is normal, at least for now. The problem is under which name to store the data for recursive archives. I think this issue might be a duplicate of #79 https://github.com/mxmlnkn/ratarmount/issues/79.

— Reply to this email directly, view it on GitHub https://github.com/mxmlnkn/ratarmount/issues/85#issuecomment-1137190861, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAG76GIFPQVXHKHQZ5TDSILVLYNZBANCNFSM5W43SFAQ . You are receiving this because you authored the thread.Message ID: @.***>

mxmlnkn commented 2 years ago

What about storing all .tar indices in single DB?

For further databases caused by recursive archives, I think I answered your question.

Do you mean when using the union mounting feature like so: ratarmount file1.tar file2.tar mountpoint? In this case, I think it is better to have one DB per archive in order to increase reusability when, e.g., trying to mount only file1.tar or when trying to add another archive to the union mount: ratarmount file1.tar file2.tar file3.tar mountpoint.

What is your use case?

Vadiml1024 commented 2 years ago

Maybe I expressed myself incorrectly.... Actually for each .tar file ratarmount creates an SQLite database specific to this .tar. I thought maybe it'll be more efficient to have ONE database which contains data for all archives simultaneously? Of course, this will require significant modifications of the existing codebase but not too complicated I think. The idea would be to assign virtual_inode_number to each archive and include it as a key field in all tables of this unified db... The advantage of this approach is that it could be easily adapted to other SQL-based database which is useful when mounting directories with a LOT of really BIG archives. I'm talking about disks with several TB of data and archives of hundreds GBs and more than 100K files inside. This is actually my use case. Thanks to your advice i've implemented a kind of hybrid between guestmount and ratarmount. I use libguestfs to mount .iso, .img, .ova, .vmdk files then I create a temp dir containing mount points (with help of mount --bind) for the above files and then in launch ratarmount -r -l to mount this temp dir. Given the fact, that the disk images contain big archives with archives inside and that ratarmount uses :memory: to index archives inside archives the memory consumption is pretty impressive, hence my ideas on reorganizing the DB.

mxmlnkn commented 2 years ago

Ah, I see. I think, this really is very related to #79 then but it goes one step further and would also combine data from "sibling" archives (those in the same bind-mounted folder) not just "descendants" (recursively nested archives).

But, maybe your problem also will disappear when using the new --recursion-depth 1 argument from #84. Of course, this is only possible if you don't want to mount recursively deeper than that. If it still creates indexes in memory, then that might be because it can't find a suitable writable location and --index-folders might help. I noticed that you also tried that out... Now I kinda understand your problem.

When using -r option I'm getting log messages about SQLite databases created in :memory: even when --index-folder is specified. Is it normal?

Could you paste one of those warnings? I'm beginning to doubt that it is normal. Also, what is the compression chain? It should only try to use an in-memory database in circumstances like mounting a compressed tar that is inside another archive.

Vadiml1024 commented 2 years ago

Could you paste one of those warnings? I'm beginning to doubt that it is normal. Also, what is the compression chain? It should only try to use an in-memory database in circumstances like mounting a compressed tar that is inside another archive.

That is precisely my case... So it seems to be expected behavior.

mxmlnkn commented 2 years ago

Could you paste one of those warnings? I'm beginning to doubt that it is normal. Also, what is the compression chain? It should only try to use an in-memory database in circumstances like mounting a compressed tar that is inside another archive.

That is precisely my case... So it seems to be expected behavior.

Unfortunately, yes. But, I'll try to fix it, but it might take a while :/. PRs are welcome ...

My basic idea to fix this is outlined in #79. The downwards-compatible version would simply add tables for each recursively contained TAR. Simply adding it inside the existing table won't work because there would not be enough information. The table basically just stores names and offsets. It would additionally also need to safe something like the path to the recursive archive but that seems like a waste of space and might also be harder to implement because as it is implemented now, multiple SQLiteIndexedTar instances will be created, one for each recursive TAR, and they basically don't know of each other.

Well, instead of this large amount of work, it might be simpler, to support writing indexes of in-memory file objects out to the index-folder. The only problem is to somehow generate a stable name. E.g. using the hash of the whole archive would be a good name but it would be too cost-prohibitive to calculate. A hash over the metadata might work though as those data has to be calculated anyway and should be magnitudes smaller than the file contents. And the file contents don't matter anyway.

But, in order to speed up loading with the indexes, I wouldn't be able to check all metadata, only like the first 1000. I'm already doing something similar to detect TARs which have been appended to. It still would only be a heuristic nothing 100% stable :/.

Currently, if the TAR is not placed directly besides the archive, it will be placed in the home folder with a kind of cleaned up path to the archive as name. I might simply store those recursive tars with their inner path appended to the path of the outside tar. That should be unique enough. If the path becomes too long to use as a file name, I could simply hash it.

Hmm, thinking about it, I might be able to implement the second idea soonish.

Vadiml1024 commented 2 years ago

I'm not really DB expert, but I suspect that table creation is an expensive operation. Maybe the approach of adding a table mapping hash(pathname) to vitual_inode_number and then adding this vitual_inode_number as a key part to the existing tables will be more efficient?

mxmlnkn commented 2 years ago

How many archives inside the outer archive are we talking about?

Vadiml1024 commented 2 years ago

The biggest one i've met is more than 300K

mxmlnkn commented 2 years ago

That is quite a lot and indeed might need more brainstorming and benchmarking :/.

This might also trigger performance problems at other locations in the code, for example inside the AutoMountLayer class, which mounts those archives recursively and has to keep a map of all mounted recursive locations and has to look them up each time FUSE requests a file.

Vadiml1024 commented 2 years ago

Given that the self.mounted is a dict in AutoMountLayer I think that the lookup is not too much time consuming

Vadiml1024 commented 1 year ago

Maybe I expressed myself incorrectly.... Actually for each .tar file ratarmount creates a sqlite database specific to this .tar. I thought maybe it'll be more efficient to have ONE database which contains data for ALL archives simultaneuously? Of course this will require significant modifications of existing codebase but not too complicated i think. The idea would be to assign virtual_inode_number to each archive and include it as a key field in all tables of this unique db... The advantage of this approach is that it could be easily adapted to other SQL based database which is useful where mounting directories with a LOT of really BIG archives. I'm talking about disks with of several TB of data and archive of hundreds GBs and more than 100K files inside. This is actually my use case. Thanks to your advice i've implement a kind of hybrid between guestmount and ratarmount. I use libguestfs to mount .is, .img, .ova, .vmdk files then i create a temp dir containing mount points (with help of mount --bind) for the above files and then in launch ratarmount -r -l to mount this temp dir. Given the fact the the disks images contains big archives with archives inside and that ratarmount uses :memory: ro index archives inside archives the memory consumption is pretty impressive, hence my ideas on reorganizing the DB.

On Wed, May 25, 2022 at 3:51 PM Maximilian Knespel @.***> wrote:

What about storing all .tar indices in single DB?

For further databases caused by recursive archives, I think I answered your question.

Do you mean when using the union mounting feature like so: ratarmount file1.tar file2.tar mountpoint? In this case, I think it is better to have one DB per archive in order to increase reusability when, e.g., trying to mount only file1.tar or when trying to add another archive to the union mount: ratarmount file1.tar file2.tar file3.tar mountpoint.

What is your use case?

— Reply to this email directly, view it on GitHub https://github.com/mxmlnkn/ratarmount/issues/85#issuecomment-1137269963, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAG76GLMLBQ5D43YSJDDUTDVLYV7DANCNFSM5W43SFAQ . You are receiving this because you authored the thread.Message ID: @.***>