[Feature Request] Parse bags from archives (zip/tar)

We use bag-database for helping to manage our ROS bags, but are wanting to manage additional log data, outside of the bag, such as text logs, or 3rd-party binary logs from vehicle-related drivers, etc. We like the Bag Database for organizing and giving us a snapshot of the collected bag (the bag is our primary source of data), but would like to have access to all of our data in a single archive.

I'd like to explore the option of adding support for extracting bags from an archive, such as a .zip or .tar/.tar.gz file. Ideally, BagDb would watch/scan for archives, and when found, would scan the ToC of the archive for any included bags. If found, it would extract and index the bag from the zip/tag. From the users perspective, we don't want to present the extracted bag in the UI, rather we want to reference the bag from the archive as though the archive is the bag. Selecting the archive would display the same summary information, etc. Downloading the bag would download the archive (not the internal bag). However, actions like playing video, analyzing tracks/path, topics, etc, would internally extract the bag to a temporary location, and stream from that location. After some window of (configurable) time of no access to the bag (say, 60 minutes), we'd remove the temporary file to keep things clean.

I noticed in the source for bag/storage/s3/S3BagWrapperImpl.java that there is a TODO for minimizing redundant access to a file (for video, etc). I think this would also apply here. I don't know if there is sufficient mechanism for essentially nesting one storage type under another (ie, S3 storage can use Zip storage).

I'm relatively new to ROS (~9 months), but am a seasoned veteran of Java/Web, and am happy to contribute, but would probably need some guidance in understanding how best to leverage the existing storage framework. If this is totally outside the storage frameworks current design, that is fine too.. at least then I can start looking for alternate solutions to our problem.

We like BagDB and use it daily! Thank you to all for creating/maintaining it!

Being able to store miscellaneous files together with a bag would definitely be useful, and I'd be interested in adding support for this, but I suspect it'll be a good amount of work.

The problem with dealing with compressed archives is that random access to specific data inside a compressed archive is generally not possible; you have to decompress the entire archive to get to the files inside. The Bag DB needs random access in order to be able to do things like display images or stream videos from bag files, or to be able to integrate with tools like Webviz that can index to arbitrary positions within a bag file.

Like you suggested, one possibility is allowing users to upload a compressed archive, then decompress and extract it on the server side on demand; that could be useful, but the biggest issue there is that it could take a very long time to extract large bags. I've worked with bags that were dozens of GB in size, and on a slower system (and NAS's often do not have very powerful CPUs) it could potentially take hours to decompress that, which isn't really feasible if a user wants to quickly display a bag in Webviz; even for smaller bags, having to wait a few minutes is annoying.

I would lean more toward just storing everything unarchived; users could potentially still upload compressed archives, but the Bag DB could decompress them before writing them to its storage. Then it could easily provide random access to any of the associated files, and it could re-compress them on the fly if a user wanted to download everything. There are some other potential issues to think through here (you don't necessarily have to answer these questions, I'm just writing them down so I can think about them):

If a user copies an archive to the Bag DB's folder instead of uploading it, should we decompress it, add the bag & related files to the database, and then delete the original archive?
- We could also just leave the original archive, which would make downloading it faster, but then also use >2x as much disk space.
What if it turns out the archive doesn't contain a bag file? We should probably not delete it but also somehow mark it in the database so we don't try to process it again.
What if it contains multiple bags? Should they all be added as separate entries, or somehow bundled together? The UI & database schema might need some work to handle representing and storing multiple bag files that are all part of a single bundle.
What if a file inside the archive has the same name as an existing file? The solution to this is probably to make a new directory for the bundle an unarchive everything in there, but if a user is already using directories to organize their bags, how do we make directories that contain a bundle distinct from directories used for organizing multiple bags? Maybe add a special extension or add a special file inside the directory to mark it?
How should the SQL database keep track of all of the other files associated with a bag? If we need to search for or store metadata about them, it'd make sense to add another table to store that; but if not, we might be able to avoid pointless database operations and just scan that directory on demand.

There's already some support for being able to read in separate YAML files that contain metadata about bag files, and it might be possible to fold in that functionality as a subset of this. It might also be useful to be able to update associated files based on output from running a bag through a script in a Docker container.

This could also be useful as a stepping stone for working toward ROS2 bag support, since the default ROS2 bag storage mechanism treats bags as directories that contain a metadata file and a SQLite database, which is very inconvenient under the Bag DB's current paradigm of having one file per bag.

swri-robotics / bag-database

[Feature Request] Parse bags from archives (zip/tar) #200