swri-robotics / bag-database

A server that catalogs bag files and provides a web-based UI for accessing them.
Other
346 stars 72 forks source link

[Feature Request] Parse bags from archives (zip/tar) #200

Open kylemallory opened 1 year ago

kylemallory commented 1 year ago

We use bag-database for helping to manage our ROS bags, but are wanting to manage additional log data, outside of the bag, such as text logs, or 3rd-party binary logs from vehicle-related drivers, etc. We like the Bag Database for organizing and giving us a snapshot of the collected bag (the bag is our primary source of data), but would like to have access to all of our data in a single archive.

I'd like to explore the option of adding support for extracting bags from an archive, such as a .zip or .tar/.tar.gz file. Ideally, BagDb would watch/scan for archives, and when found, would scan the ToC of the archive for any included bags. If found, it would extract and index the bag from the zip/tag. From the users perspective, we don't want to present the extracted bag in the UI, rather we want to reference the bag from the archive as though the archive is the bag. Selecting the archive would display the same summary information, etc. Downloading the bag would download the archive (not the internal bag). However, actions like playing video, analyzing tracks/path, topics, etc, would internally extract the bag to a temporary location, and stream from that location. After some window of (configurable) time of no access to the bag (say, 60 minutes), we'd remove the temporary file to keep things clean.

I noticed in the source for bag/storage/s3/S3BagWrapperImpl.java that there is a TODO for minimizing redundant access to a file (for video, etc). I think this would also apply here. I don't know if there is sufficient mechanism for essentially nesting one storage type under another (ie, S3 storage can use Zip storage).

I'm relatively new to ROS (~9 months), but am a seasoned veteran of Java/Web, and am happy to contribute, but would probably need some guidance in understanding how best to leverage the existing storage framework. If this is totally outside the storage frameworks current design, that is fine too.. at least then I can start looking for alternate solutions to our problem.

We like BagDB and use it daily! Thank you to all for creating/maintaining it!

pjreed commented 1 year ago

Being able to store miscellaneous files together with a bag would definitely be useful, and I'd be interested in adding support for this, but I suspect it'll be a good amount of work.

The problem with dealing with compressed archives is that random access to specific data inside a compressed archive is generally not possible; you have to decompress the entire archive to get to the files inside. The Bag DB needs random access in order to be able to do things like display images or stream videos from bag files, or to be able to integrate with tools like Webviz that can index to arbitrary positions within a bag file.

Like you suggested, one possibility is allowing users to upload a compressed archive, then decompress and extract it on the server side on demand; that could be useful, but the biggest issue there is that it could take a very long time to extract large bags. I've worked with bags that were dozens of GB in size, and on a slower system (and NAS's often do not have very powerful CPUs) it could potentially take hours to decompress that, which isn't really feasible if a user wants to quickly display a bag in Webviz; even for smaller bags, having to wait a few minutes is annoying.

I would lean more toward just storing everything unarchived; users could potentially still upload compressed archives, but the Bag DB could decompress them before writing them to its storage. Then it could easily provide random access to any of the associated files, and it could re-compress them on the fly if a user wanted to download everything. There are some other potential issues to think through here (you don't necessarily have to answer these questions, I'm just writing them down so I can think about them):

There's already some support for being able to read in separate YAML files that contain metadata about bag files, and it might be possible to fold in that functionality as a subset of this. It might also be useful to be able to update associated files based on output from running a bag through a script in a Docker container.

This could also be useful as a stepping stone for working toward ROS2 bag support, since the default ROS2 bag storage mechanism treats bags as directories that contain a metadata file and a SQLite database, which is very inconvenient under the Bag DB's current paradigm of having one file per bag.