nathanmarz / cascalog

Data processing on Hadoop without the hassle.
Other
1.38k stars 178 forks source link

Preserve order for grouping fields. #220

Closed funkenblatt closed 10 years ago

funkenblatt commented 10 years ago

Currently, the grouping fields in a query that uses aggregators get put into a scrambled order due to the use of set operations. This is normally not a problem, but it can become one when you are expecting to see similar groups to be close together, e.g. when using a template tap with template fields that are coarser than the group-by fields. In this situation, the template tap may be forced to re-open files that it has previously seen, which may cause "file already exists" errors.

sritchie commented 10 years ago

Nice.

tomjack commented 10 years ago

Great!

I think this also makes it possible to reliably group on incomparable values (say, numbers and strings), when there is a comparable value you can put first such that, for a given first grouping value, all later grouping values will be comparable. Example:

(let [tvals [[:long 2]
             [:long 3]
             [:str "foo"]
             [:str "bar"]]]
  (??<- [?type ?val]
    (tvals ?type ?val)
    (:distinct true)))

Before this change, this will sometimes break, and reordering the grouping vars is futile. I've backported this change for this reason.