IO compression and alternative serialization stores

bmcfee commented 9 years ago

JAMS files can get pretty large in plain text. It would be nice to support compression of some kind, eg, by allowing a gzip file handle instead of a filename. This will probably take a little bit of refactoring to do properly, but I see no downside in directly supporting jams.gz (or jamz, if you will) as a format.

While we're at it, what are folks' opinions on generalizing the backend from JSON? I can imagine use-cases where pickle or bson might be preferable. Ideally, this would all be transparent to the user, and all load/save operations would work out of the box.

Arguments against doing this:

Supporting multiple serialization backends might break interoperability
Loss of plaintext interpretability

Arguments in favor:

Greater flexibility for users
Binary formats would be more efficient (on disk) than json

In all cases, we'd still use json schema validation, so functionally nothing would change.

So... thoughts?

bmcfee commented 9 years ago

Just thinking this one through a little more: I think it can be implemented in a pretty clean way by writing a custom context manager for file IO. We'd simply replace all calls to open with the custom IO routine, which would intelligently select the backend codec (open, gzip.open, or no-op if provided a file handle), and do the right thing.

Adding extra backends in the future will then be simple, since everything is confined to the custom manager.

bmcfee commented 9 years ago

The recent string of commits adds support for jamz format, and gets us some pretty good efficiency improvements:

[~/git/jams/jams/tests/fixtures]$ ls -l
total 44K
-rw-rw-r-- 1 bmcfee bmcfee 5.3K May 18 15:07 invalid.jams
-rw-rw-r-- 1 bmcfee bmcfee 5.3K May 18 14:43 valid.jams
-rw-rw-r-- 1 bmcfee bmcfee  590 Jun 17 14:14 valid.jamz

Compared to the raw source material for valid.jams, we incur about 300 bytes of overhead:

[~/data/SMC_MIREX]$ ls -l */SMC_001*
-rw-r--r-- 1 bmcfee bmcfee 248 Dec 22  2011 SMC_MIREX_Annotations/SMC_001_2_1_1_a.txt
-rw-r--r-- 1 bmcfee bmcfee  34 Oct 23  2011 SMC_MIREX_Tags/SMC_001.tag

but that's only a rough estimate, since there's a bit more information in the jams file (paths, track duration, etc.)

We also now have direct io to an open file descriptor, so a web service can write a jams object directly to the stream without going through the filesystem.

bson or pickle backends, should we implement them, must be handled separately since they provide different serialization.

This probably needs a few more test cases, but otherwise I think it's good to go.

urinieto commented 9 years ago

Nice!

On Wed, Jun 17, 2015 at 3:28 PM, Brian McFee notifications@github.com wrote:

The recent string of commits adds support for jamz format, and gets us some pretty good efficiency improvements:

[~/git/jams/jams/tests/fixtures]$ ls -l total 44K -rw-rw-r-- 1 bmcfee bmcfee 5.3K May 18 15:07 invalid.jams -rw-rw-r-- 1 bmcfee bmcfee 5.3K May 18 14:43 valid.jams -rw-rw-r-- 1 bmcfee bmcfee 590 Jun 17 14:14 valid.jamz

Compared to the raw source material for valid.jams, we incur about 300 bytes of overhead:

[~/data/SMCMIREX]$ ls -l /SMC001 -rw-r--r-- 1 bmcfee bmcfee 248 Dec 22 2011 SMC_MIREX_Annotations/SMC_001_2_1_1_a.txt -rw-r--r-- 1 bmcfee bmcfee 34 Oct 23 2011 SMC_MIREX_Tags/SMC_001.tag

but that's only a rough estimate, since there's a bit more information in the jams file (paths, track duration, etc.)

We also now have direct io to an open file descriptor, so a web service can write a jams object directly to the stream without going through the filesystem.

bson or pickle backends, should we implement them, must be handled separately since they provide different serialization.

This probably needs a few more test cases, but otherwise I think it's good to go.

— Reply to this email directly or view it on GitHub https://github.com/marl/jams/issues/39#issuecomment-112920834.

marl / jams

IO compression and alternative serialization stores #39