mozilla / jydoop

Efficient Hadoop Map-Reduce in Python
Other
31 stars 19 forks source link

Added Lists, Dicts to serialization code #21

Closed tarasglek closed 11 years ago

tarasglek commented 11 years ago

I did not do != 0 comparisons for dicts because that's confusing. I'm not convinced we need that.

I also did not implement byte-level dict comparison. I'm still not sure why you implemented it.

I'm still not quite sure why we need to have 1:1 mapping between sorting based on raw bytes & by higher level datastructures. The only thing that's important is that things that are equal as higher level objects remain equal as when represented as a bytestream...how they are sorted relative to other keys seems to not be important. Is there some detail I'm missing?

bsmedberg commented 11 years ago

By "byte-level comparison" you mean WritableComparator? That's actually very important for performance when combining/reducing, because actually creating the PyObjects is pretty expensive. I tend to think that we should not allow dicts or lists in keys, but we could allow them in values...

bsmedberg commented 11 years ago

I'm going to take this and see if I can separate out a key class which supports comparison and no dicts and a value class with dicts and no comparison.

bsmedberg commented 11 years ago

https://github.com/tarasglek/jydoop/pull/1 contains the changes which do what I think we want here.