twitter / scalding

A Scala API for Cascading
http://twitter.com/scalding
Apache License 2.0
3.5k stars 706 forks source link

make ordered serialization stable across compilations #1664

Closed fwbrasil closed 7 years ago

fwbrasil commented 7 years ago

Problem

Ordered serialization for thrift unions is not stable across scalac executions. This issue happens because knownDirectSubclasses returns an unordered Set and the Type concrete classes don't implement stable hashCodes.

Solution

Sort the result of the knownDirectSubclasses method by the type name.

Notes

johnynek commented 7 years ago

Seems like a worthwhile fix to me.

👍

isnotinvain commented 7 years ago

Ideally, we'd probably sort these by something like thrift field ID, but I guess it doesn't really matter, I don't think OrderedSerialization makes any promises about schema evolution right?

fwbrasil commented 7 years ago

I don't think OrderedSerialization makes any promises about schema evolution right?

@isnotinvain I'm not sure, looking at the code I don't see why this could be a problem to the user who reported the issue since it's stable for a single compilation. Maybe there's a path somewhere that reuses serialized data from previous executions/versions? I haven't heard back from the user on this, and the problem was affecting only an integration test, so maybe we can drop the backward compatibility if we consider that OrderedSerialization shouldn't support this scenario? cc/ @johnynek

johnynek commented 7 years ago

I think the use case here is for ephemeral serialization (between tasks deployed from the same jar). I don't think we have contemplated this as a replacement for any long term storage.

johnynek commented 7 years ago

👍