tresata / spark-sorted

Secondary sort and streaming reduce for Apache Spark
Apache License 2.0
78 stars 17 forks source link

mapStreamByKey fails silent if treated as flatMap #5

Closed jhalliday closed 9 years ago

jhalliday commented 9 years ago

Using the Java API in 0.3.0, I inadvertently implemented a mapStreamByKey function that returned an Iterator over a different (lower, sometimes zero) number of elements to that in the input iterator i.e. I wrote a flatMap where I needed a regular Map. This did not cause an error, but produced an incorrect result, seemingly causing early termination such that many keys were never presented to the iterator. So this issue is in two parts: a) using mapStreamByKey should fail with an appropriate exception if the input and output sizes do not match and b) please can we have an equivalent flatMapStreamByKey function in the API. thx.

koertkuipers commented 9 years ago

hello, thanks for posting this.

mapStreamByKey does not make the assumption for the passed in function that the input and output iterators have to be of same size. and if the output iterator is if size 0 then the key should indeed disappear from the output. is this what you observed?

i agree that the name is somewhat confusing. maybe it should be called flatMapStreamByKey instead.

map

On Tue, Aug 4, 2015 at 11:13 AM, jhalliday notifications@github.com wrote:

Using the Java API in 0.3.0, I inadvertently implemented a mapStreamByKey function that returned an Iterator over a different (lower, sometimes zero) number of elements to that in the input iterator i.e. I wrote a flatMap where I needed a regular Map. This did not cause an error, but produced an incorrect result, seemingly causing early termination such that many keys were never presented to the iterator. So this issue is in two parts: a) using mapStreamByKey should fail with an appropriate exception if the input and output sizes do not match and b) please can we have an equivalent flatMapStreamByKey function in the API. thx.

— Reply to this email directly or view it on GitHub https://github.com/tresata/spark-sorted/issues/5.

jhalliday commented 9 years ago

and if the output iterator is of size 0 then the key should indeed disappear from the output.

it does, but iterators for subsequent keys in the same partition are not then passed to the map function.

koertkuipers commented 9 years ago

ok, that sounds like a serious bug. i will investigate asap.

On Tue, Aug 4, 2015 at 12:18 PM, jhalliday notifications@github.com wrote:

and if the output iterator is of size 0 then the key should indeed disappear from the output.

it does, but iterators for subsequent keys in the same partition are not then passed to the map function.

— Reply to this email directly or view it on GitHub https://github.com/tresata/spark-sorted/issues/5#issuecomment-127663281.

lvsl-deactivated commented 9 years ago

:+1: for renaming.

koertkuipers commented 9 years ago

i can confirm the bug. i hope to get a fix out today. i think i understand what is causing it. thanks for reporting!

i am also considering adding more robust unit tests, maybe use ScalaCheck to generate many random tests, because this iterator stuff is clearly getting the better of me.

koertkuipers commented 9 years ago

i think the logic for the name mapStream was that you map on the iterator (which represents the stream). so for every iterator in its always exactly one iterator out, hence map, not flatMap.

still think it should be renamed?

koertkuipers commented 9 years ago

i believe this is fixed in commit 7a5173008f45a7829e3f6ef614add36fd5da0f33 and release 0.3.1 (c721f791f3bb5117bbaa076ed8dcc14e470eaaac)

thanks again for catching and reporting this