sjyk / sampleclean-async

http://sampleclean.org
Apache License 2.0
92 stars 27 forks source link

Crowd Context #73

Open sjyk opened 9 years ago

sjyk commented 9 years ago

Include other cols in the task.

sjyk commented 9 years ago

This is actually hard to do, since the current code applies a distinct count first and then runs attrdedup

thisisdhaas commented 9 years ago

Hm. Could we rewrite the initial count distinct query as a group by?

e.g. SELECT name, first(col1), first(col2), ... FROM t GROUP BY name

This requires spark SQL to have a first aggregate, or some other way of getting a value out of the group.