twitter / scalding

A Scala API for Cascading
http://twitter.com/scalding
Apache License 2.0
3.48k stars 704 forks source link

Remove fold cases and sumAll from CoGroupable.atMostOneFn since folds and sumAll can return 1+ result from 0 input #1881

Closed dieu closed 5 years ago

dieu commented 5 years ago

Hey,

We found the problem with Join and if something before can produce non-empty iterator for empty-iterator, like fold, foldLeft, crazy Semigroup in sumLeft.

While we are joining, we get an iterator from cascading and try to apply everything that's not yet applied to this value, some of the function can produce non-empty iterator even if don't have anything (aka cascading give us empty iterator).

johnynek commented 5 years ago

this is certainly scary.... can you also add one using mapGroup which should also show the bug, since fold is just calling mapGroup...

dieu commented 5 years ago

so, the problem with fold, we doing fold on empty-iterator (when no data on left), but it's legit that fold on empty produce init as result.

johnynek commented 5 years ago

last comment, does only fold trigger it? We don't often use fold, but we do very often use .sum before a join.

I can't believe we don't have test coverage for this case. That's really unfortunate.

dieu commented 5 years ago

last comment, does only fold trigger it? We don't often use fold, but we do very often use .sum before a join.

writing tests for others...