Open ezyang opened 9 years ago
Unfortunately, when we picked the serialisation format 8 years ago we were not so careful, and the format has not changed since then. Backwards compatibility is the main concern for not fixing silly things like this issue.
If we do a revision of the API in the future that requires code changes in user code, then we could take the opportunity to also fix things like this.
This is indeed annoying issue for code which need to serialize a lot of Floats/Doubles (scientific code for example). There're other suboptimal instances. Serialization of list forces its spine, empty list is serialized to 8(!) bytes
Another option is to introduce another type class with same API for serialization with well documented serialization format which fixes problems with current format and encourage people to switch to it.
If you have a lot of Floats/Doubles, you should use an array of some kind, wrap it in a newtype and write your own more efficient Binary instance.
If your goal is just to serialize some data, you should be using @dcoutts 's upcoming CBOR work. The wire format is significantly better, and the implementation is way faster. That work may be included into the binary
package, and it'll be exposed as a new type class plus functions to serialize/deserialize.
Vector of doubles is not so bad. It's data structures containg doubles are problematic. Automatically derived instances will use instance for double. In the end I gave up and just gzipped everything after encoding.
Glad to hear.
Is there an estimate of when this will be fixed? If the CBOR approach will not be ready in the foreseeable future, then we should begin a deprecation cycle. First, introduce a better alternative encoding as an opt-in, and deprecate the current encoding by recommend that everyone use the new one. Then, after a time, make the new encoding the default and leave the old format supported as opt-in.
See also #64 - that should also be fixed in the new encoding.
Note that people are publicly advocating abandoning this library for these reasons.
My understanding is that the CBOR work is rather close to being ready although @dcoutts may be able to say more.
My understanding is that the CBOR work is rather close to being ready although @dcoutts may be able to say more.
IMO, it's 'done' in the sense that it literally compiles and you can use it. It's not done in the sense it's nowhere near feature parity with binary
, including things like Generic
support, having documentation (it has zero), more tests, more instances, and catching edge cases like this, and tightening the performance (which is already factors better than binary
, at least). Given the nastiness of this issue, it would probably be better to lean on the side of caution and get all this right first before it bites us again.
In other words: it's 80% done. So we only have to do the remaining 80% of the work.
(Edit: I say this since I've been working on improving this, but it's been slow and obviously I've been away the past few weeks)
Then I think it would be good idea to create tickets for missing features. It would be clear what's missing to make library relse ready and maybe someone even sumbit patch
As in #79 , Is there a road-map for a new version of binary? The breaking change is indeed a valid concern, but this is exactly what PVP designed for, isn't it?
Data.Conduit.Serialization.Binary
also makes use of this. I had a byte full of 8 byte real (floats with double precision) and it kept complaining about "not enough bytes" when i was sure i had a round number of multiples of 8 !
If your goal is just to serialize some data, you should be using @dcoutts 's upcoming CBOR work.
Are you talking about this repository? https://github.com/well-typed/binary-serialise-cbor I couldn't find anything on dcoutts' personal repositories.
Yes, that's the correct one.
Is somebody deciding whether or how to fix this problem? Not round-tripping properly is a bug not a feature for a serialization library, and I think this issue if not fixed should at least have a proper warning in the documentation to not give users unpleasant surprises.
Also, since binary is being packaged with ghc, it is not simple to just put a fix in a github fork and use it in one's project. At least it is not simple to do in a stack project.
The fix for this bug is very simple, but is there inertia/analysis-paralysis about backward-compatibility? That could be the main issue but there is no discussion about this.
The fix for this bug is very simple, but is there inertia/analysis-paralysis about backward-compatibility? That could be the main issue but there is no discussion about this.
Indeed the backwards compatibility concerns do not help matters. Specifically, the documentation is not at all clear on what compatibility contract the provided instances follows and the nature of the serialisation format makes it essentially impossible to change the existing instances without breaking existing serialised data.
I think our two options here are either:
a. decide that compatibility isn't important, perform a major version bump and fix this and any other format known issues, or
b. expose newtype Ieee{Float,Double}
which has sensible instances
I personally am fine with Option a and prefer it over both Option b and doing nothing, but if there is a concern about a backward compatibility issue where existing serialised data gets deserialised wrongly by existing user code with too-relaxed binary version upper bounds using the corrected Binary instances, here is an Option c that can partially address that concern, albeit in a more inconvenient way:
Binary
instances of Float
& Double
out of Data.Binary
into a new module (e.g. Data.Binary.Instances.Moribund
).Binary
instances of Float
& Double
in another new module (e.g. Data.Binary.Instances.Incoming
).Data.Binary
, and then make a major-version-bump release.Meanwhile, document a policy about how encoding may (or may not) change w.r.t. version number changes, and document every encoding change in the change log of every release, announce the stages of the plan, and include in the Haddock clear warning to not use both the old and corrected instances.
Option b means that people have to use the newtype Ieee
wrapper everywhere because no unwrapped Float
or Double
is safe from the old encoding since Binary
instances for records containing Float
or Double
can be generically derived and end up using the old encoding.
Both Option b and doing nothing to me look a bit like the opposite of avoiding success at all cost. They emphasize the value of backward-compatibility with respect to the past in an unclear number of use cases over correctness with respect to perpetuity.
Apparently, binary represents a Double as a tuple of (Integer, Int)? This means that doubles suffer a x3 or more size explosion, when really you could just record an IEEE floating point with the proper endian. This would also fix #64
Backwards compatibility might be a concern for fixing this, however.