Instance for Double/Float is absolutely batty

ezyang commented 9 years ago

Apparently, binary represents a Double as a tuple of (Integer, Int)? This means that doubles suffer a x3 or more size explosion, when really you could just record an IEEE floating point with the proper endian. This would also fix #64

Backwards compatibility might be a concern for fixing this, however.

kolmodin commented 9 years ago

Unfortunately, when we picked the serialisation format 8 years ago we were not so careful, and the format has not changed since then. Backwards compatibility is the main concern for not fixing silly things like this issue.

If we do a revision of the API in the future that requires code changes in user code, then we could take the opportunity to also fix things like this.

Shimuuar commented 9 years ago

This is indeed annoying issue for code which need to serialize a lot of Floats/Doubles (scientific code for example). There're other suboptimal instances. Serialization of list forces its spine, empty list is serialized to 8(!) bytes

Another option is to introduce another type class with same API for serialization with well documented serialization format which fixes problems with current format and encourage people to switch to it.

kolmodin commented 9 years ago

If you have a lot of Floats/Doubles, you should use an array of some kind, wrap it in a newtype and write your own more efficient Binary instance.

If your goal is just to serialize some data, you should be using @dcoutts 's upcoming CBOR work. The wire format is significantly better, and the implementation is way faster. That work may be included into the binary package, and it'll be exposed as a new type class plus functions to serialize/deserialize.

Shimuuar commented 9 years ago

Vector of doubles is not so bad. It's data structures containg doubles are problematic. Automatically derived instances will use instance for double. In the end I gave up and just gzipped everything after encoding.

Glad to hear.

ygale commented 8 years ago

Is there an estimate of when this will be fixed? If the CBOR approach will not be ready in the foreseeable future, then we should begin a deprecation cycle. First, introduce a better alternative encoding as an opt-in, and deprecate the current encoding by recommend that everyone use the new one. Then, after a time, make the new encoding the default and leave the old format supported as opt-in.

See also #64 - that should also be fixed in the new encoding.

Note that people are publicly advocating abandoning this library for these reasons.

bgamari commented 8 years ago

My understanding is that the CBOR work is rather close to being ready although @dcoutts may be able to say more.

thoughtpolice commented 8 years ago

My understanding is that the CBOR work is rather close to being ready although @dcoutts may be able to say more.

IMO, it's 'done' in the sense that it literally compiles and you can use it. It's not done in the sense it's nowhere near feature parity with binary, including things like Generic support, having documentation (it has zero), more tests, more instances, and catching edge cases like this, and tightening the performance (which is already factors better than binary, at least). Given the nastiness of this issue, it would probably be better to lean on the side of caution and get all this right first before it bites us again.

In other words: it's 80% done. So we only have to do the remaining 80% of the work.

(Edit: I say this since I've been working on improving this, but it's been slow and obviously I've been away the past few weeks)

Shimuuar commented 8 years ago

Then I think it would be good idea to create tickets for missing features. It would be clear what's missing to make library relse ready and maybe someone even sumbit patch

winterland1989 commented 7 years ago

As in #79 , Is there a road-map for a new version of binary? The breaking change is indeed a valid concern, but this is exactly what PVP designed for, isn't it?

flip111 commented 7 years ago

Data.Conduit.Serialization.Binary also makes use of this. I had a byte full of 8 byte real (floats with double precision) and it kept complaining about "not enough bytes" when i was sure i had a round number of multiples of 8 !

flip111 commented 7 years ago

If your goal is just to serialize some data, you should be using @dcoutts 's upcoming CBOR work.

Are you talking about this repository? https://github.com/well-typed/binary-serialise-cbor I couldn't find anything on dcoutts' personal repositories.

thoughtpolice commented 7 years ago

Yes, that's the correct one.

jchia commented 2 years ago

Is somebody deciding whether or how to fix this problem? Not round-tripping properly is a bug not a feature for a serialization library, and I think this issue if not fixed should at least have a proper warning in the documentation to not give users unpleasant surprises.

Also, since binary is being packaged with ghc, it is not simple to just put a fix in a github fork and use it in one's project. At least it is not simple to do in a stack project.

The fix for this bug is very simple, but is there inertia/analysis-paralysis about backward-compatibility? That could be the main issue but there is no discussion about this.

bgamari commented 2 years ago

The fix for this bug is very simple, but is there inertia/analysis-paralysis about backward-compatibility? That could be the main issue but there is no discussion about this.

Indeed the backwards compatibility concerns do not help matters. Specifically, the documentation is not at all clear on what compatibility contract the provided instances follows and the nature of the serialisation format makes it essentially impossible to change the existing instances without breaking existing serialised data.

I think our two options here are either:

a. decide that compatibility isn't important, perform a major version bump and fix this and any other format known issues, or b. expose newtype Ieee{Float,Double} which has sensible instances

jchia commented 2 years ago

I personally am fine with Option a and prefer it over both Option b and doing nothing, but if there is a concern about a backward compatibility issue where existing serialised data gets deserialised wrongly by existing user code with too-relaxed binary version upper bounds using the corrected Binary instances, here is an Option c that can partially address that concern, albeit in a more inconvenient way:

Move old Binary instances of Float & Double out of Data.Binary into a new module (e.g. Data.Binary.Instances.Moribund).
Define corrected Binary instances of Float & Double in another new module (e.g. Data.Binary.Instances.Incoming).
Make a major-version-bump release (Now all existing code that uses the instances will fail to build and will need to be fixed by importing the desired instance. This way, existing, unmodified, user code will not accidentally deserialise old data using the new encoding.)
Retire the old instances: After some time, remove the module with the old instances, make the new encoding the default requiring no additional import in addition to Data.Binary, and then make a major-version-bump release.

Meanwhile, document a policy about how encoding may (or may not) change w.r.t. version number changes, and document every encoding change in the change log of every release, announce the stages of the plan, and include in the Haddock clear warning to not use both the old and corrected instances.

Option b means that people have to use the newtype Ieee wrapper everywhere because no unwrapped Float or Double is safe from the old encoding since Binary instances for records containing Float or Double can be generically derived and end up using the old encoding.

Both Option b and doing nothing to me look a bit like the opposite of avoiding success at all cost. They emphasize the value of backward-compatibility with respect to the past in an unclear number of use cases over correctness with respect to perpetuity.

kolmodin / binary

Instance for Double/Float is absolutely batty #69