Closed bmcfee closed 9 years ago
Just thinking this one through a little more: I think it can be implemented in a pretty clean way by writing a custom context manager for file IO. We'd simply replace all calls to open
with the custom IO routine, which would intelligently select the backend codec (open
, gzip.open
, or no-op if provided a file handle), and do the right thing.
Adding extra backends in the future will then be simple, since everything is confined to the custom manager.
The recent string of commits adds support for jamz
format, and gets us some pretty good efficiency improvements:
[~/git/jams/jams/tests/fixtures]$ ls -l
total 44K
-rw-rw-r-- 1 bmcfee bmcfee 5.3K May 18 15:07 invalid.jams
-rw-rw-r-- 1 bmcfee bmcfee 5.3K May 18 14:43 valid.jams
-rw-rw-r-- 1 bmcfee bmcfee 590 Jun 17 14:14 valid.jamz
Compared to the raw source material for valid.jams
, we incur about 300 bytes of overhead:
[~/data/SMC_MIREX]$ ls -l */SMC_001*
-rw-r--r-- 1 bmcfee bmcfee 248 Dec 22 2011 SMC_MIREX_Annotations/SMC_001_2_1_1_a.txt
-rw-r--r-- 1 bmcfee bmcfee 34 Oct 23 2011 SMC_MIREX_Tags/SMC_001.tag
but that's only a rough estimate, since there's a bit more information in the jams file (paths, track duration, etc.)
We also now have direct io to an open file descriptor, so a web service can write a jams object directly to the stream without going through the filesystem.
bson
or pickle
backends, should we implement them, must be handled separately since they provide different serialization.
This probably needs a few more test cases, but otherwise I think it's good to go.
Nice!
On Wed, Jun 17, 2015 at 3:28 PM, Brian McFee notifications@github.com wrote:
The recent string of commits adds support for jamz format, and gets us some pretty good efficiency improvements:
[~/git/jams/jams/tests/fixtures]$ ls -l total 44K -rw-rw-r-- 1 bmcfee bmcfee 5.3K May 18 15:07 invalid.jams -rw-rw-r-- 1 bmcfee bmcfee 5.3K May 18 14:43 valid.jams -rw-rw-r-- 1 bmcfee bmcfee 590 Jun 17 14:14 valid.jamz
Compared to the raw source material for valid.jams, we incur about 300 bytes of overhead:
[~/data/SMCMIREX]$ ls -l /SMC001 -rw-r--r-- 1 bmcfee bmcfee 248 Dec 22 2011 SMC_MIREX_Annotations/SMC_001_2_1_1_a.txt -rw-r--r-- 1 bmcfee bmcfee 34 Oct 23 2011 SMC_MIREX_Tags/SMC_001.tag
but that's only a rough estimate, since there's a bit more information in the jams file (paths, track duration, etc.)
We also now have direct io to an open file descriptor, so a web service can write a jams object directly to the stream without going through the filesystem.
bson or pickle backends, should we implement them, must be handled separately since they provide different serialization.
This probably needs a few more test cases, but otherwise I think it's good to go.
— Reply to this email directly or view it on GitHub https://github.com/marl/jams/issues/39#issuecomment-112920834.
JAMS files can get pretty large in plain text. It would be nice to support compression of some kind, eg, by allowing a gzip file handle instead of a filename. This will probably take a little bit of refactoring to do properly, but I see no downside in directly supporting
jams.gz
(orjamz
, if you will) as a format.While we're at it, what are folks' opinions on generalizing the backend from JSON? I can imagine use-cases where pickle or bson might be preferable. Ideally, this would all be transparent to the user, and all load/save operations would work out of the box.
Arguments against doing this:
Arguments in favor:
In all cases, we'd still use json schema validation, so functionally nothing would change.
So... thoughts?