ketiltrout / getdata

The GetData Project is the reference implementation of the Dirfile Standards, a filesystem-based, column-oriented database format for time-ordered binary data.
http://getdata.sourceforge.net/
GNU Lesser General Public License v2.1
4 stars 7 forks source link

Add read-only support for zipped Dirfiles #1

Open mpetroff opened 3 years ago

mpetroff commented 3 years ago

This PR adds read-only support for reading Dirfiles that are in uncompressed Zip files. Development of the patch was motivated by a need to reduce the total file count for FLAC-encoded Dirfiles, to alleviate the backup and data transfer overheads that result from having a very large number of small files. CLASS has been using these changes for more than a year at this point. The PR is identical to the patch attached to my 2020-02-28 post to the getdata-devel mailing list, except without the documentation (since it isn't part of this Git repository). The original version of the patch dates back to 2018.

Documentation

Separate from the Dirfile encoding scheme, GetData will read Dirfiles contained in uncompressed Zip files. This functionality is meant for reading archival data, so writing to these Zip files is not supported. Using the Info-ZIP zip utility, a Zip file can be created by running zip -r0 ../dirfile.zip * from within the root of an existing Dirfile. All encoding schemes are supported by this functionality except for the two encoding schemes that already use Zip files, zzip and zzslim. The encoding scheme must be specified using the /ENCODING directive, even if the Dirfile is unencoded. For /INCLUDE directives and LINTERP field look up table files, only relative paths are supported and only without ./ and ../ syntax.

Although Zip files are most commonly created using Deflate compression, the Zip standard (ISO/IEC 21320-1) also supports Store compression, i.e., no compression at all. GetData's Zip file support requires Store compression for all data files, although either Store compression or Deflate compression can be used for any format files or any LINTERP field look up table files. With Store compression, a Zip file effectively concatenates a Dirfile's individual files together into a single file. Since a Zip file contains an offset table, unlike a tarball, random reads are supported without the need to load the entire file from disk.

Documentation patch

Index: html/dirfile.html.in
===================================================================
--- html/dirfile.html.in    (revision 1175)
+++ html/dirfile.html.in    (working copy)
@@ -1222,6 +1222,30 @@
       example isn't strictly necessary, since <i>z.r</i> could be used wherever
       <i>re_z</i> would be.)

+      <h2><a name="zippeddirfiles">Zipped Dirfiles</a></h2>
+      <p>Separate from the Dirfile encoding scheme, GetData will read Dirfiles
+      contained in uncompressed Zip files. This functionality is meant for
+      reading archival data, so writing to these Zip files is not supported.
+      Using the Info-ZIP <span class="syntax">zip</span> utility, a Zip file can
+      be created by running <span class="syntax">zip -r0 ../dirfile.zip *</span>
+      from within the root of an existing Dirfile. All encoding schemes are
+      supported by this functionality except for the two encoding schemes that
+      already use Zip files, <b>zzip</b> and <b>zzslim</b>. The encoding scheme
+      must be specified using the /ENCODING directive, even if the Dirfile is
+      unencoded. For /INCLUDE directives and LINTERP field look up table files,
+      only relative paths are supported and only without
+      <span class="syntax">./</span> and <span class="syntax">../</span> syntax.
+      <p>Although Zip files are most commonly created using <i>Deflate</i>
+      compression, the Zip standard (ISO/IEC 21320-1) also supports <i>Store</i>
+      compression, i.e., no compression at all. GetData's Zip file support
+      requires <i>Store</i> compression for all data files, although either
+      <i>Store</i> compression or <i>Deflate</i> compression can be used for any
+      <b>format</b> files or any LINTERP field look up table files. With
+      <i>Store</i> compression, a Zip file effectively concatenates a Dirfile's
+      individual files together into a single file. Since a Zip file contains an
+      offset table, unlike a tarball, random reads are supported without the
+      need to load the entire file from disk.
+
       <h2><a name="versions">History</a></h2>
       <p>The latest version of the Dirfile Standards is Version 10.
       <div class="inset">
ketiltrout commented 3 years ago

@mpetroff This looks reasonable, but I need to take a closer look. I'll probably have a few, mostly stylistic, issues.

Could you remove all the libtool wrappers from the PR (like test/sie_get_little_zip) and instead list them in the .gitignore file?

mpetroff commented 3 years ago

Could you remove all the libtool wrappers from the PR (like test/sie_get_little_zip) and instead list them in the .gitignore file?

Those were added accidentally when I converted the existing patch into the Git commit. I just removed them and added them to the .gitignore file. I squashed and force-pushed this change to remove the files from the branch history.

ketiltrout commented 3 years ago

@mpetroff I'm working through this, and hope to have it ready some time next week. Among other things, I've added a ./configure option to enable/disable the feature and improved error propagation.

Would you prefer I push the changes to your fork or copy your branch here and do it locally?

Also, would you have time to test the changes?

mpetroff commented 3 years ago

Pushing to my fork's branch is fine. It might take me a week or two to get to it, but I'll have time to test the changes. Thanks for working on getting this merged.

ketiltrout commented 2 years ago

Just to keep you up to date: I was hoping to get this finished up before releasing v0.11.0, but I think it needs more work, and it's pointing me to some changes that need fixing within the encoding framework, so it'll have to wait for the next release, which I'm hoping won't be too long from now. (Pushing GetData-0.11.0 out the door has laid bare some things that really do need some work.)