Closed jhalliday closed 9 years ago
hello, thanks for posting this.
mapStreamByKey does not make the assumption for the passed in function that the input and output iterators have to be of same size. and if the output iterator is if size 0 then the key should indeed disappear from the output. is this what you observed?
i agree that the name is somewhat confusing. maybe it should be called flatMapStreamByKey instead.
map
On Tue, Aug 4, 2015 at 11:13 AM, jhalliday notifications@github.com wrote:
Using the Java API in 0.3.0, I inadvertently implemented a mapStreamByKey function that returned an Iterator over a different (lower, sometimes zero) number of elements to that in the input iterator i.e. I wrote a flatMap where I needed a regular Map. This did not cause an error, but produced an incorrect result, seemingly causing early termination such that many keys were never presented to the iterator. So this issue is in two parts: a) using mapStreamByKey should fail with an appropriate exception if the input and output sizes do not match and b) please can we have an equivalent flatMapStreamByKey function in the API. thx.
— Reply to this email directly or view it on GitHub https://github.com/tresata/spark-sorted/issues/5.
and if the output iterator is of size 0 then the key should indeed disappear from the output.
it does, but iterators for subsequent keys in the same partition are not then passed to the map function.
ok, that sounds like a serious bug. i will investigate asap.
On Tue, Aug 4, 2015 at 12:18 PM, jhalliday notifications@github.com wrote:
and if the output iterator is of size 0 then the key should indeed disappear from the output.
it does, but iterators for subsequent keys in the same partition are not then passed to the map function.
— Reply to this email directly or view it on GitHub https://github.com/tresata/spark-sorted/issues/5#issuecomment-127663281.
:+1: for renaming.
i can confirm the bug. i hope to get a fix out today. i think i understand what is causing it. thanks for reporting!
i am also considering adding more robust unit tests, maybe use ScalaCheck to generate many random tests, because this iterator stuff is clearly getting the better of me.
i think the logic for the name mapStream was that you map on the iterator (which represents the stream). so for every iterator in its always exactly one iterator out, hence map, not flatMap.
still think it should be renamed?
i believe this is fixed in commit 7a5173008f45a7829e3f6ef614add36fd5da0f33 and release 0.3.1 (c721f791f3bb5117bbaa076ed8dcc14e470eaaac)
thanks again for catching and reporting this
Using the Java API in 0.3.0, I inadvertently implemented a mapStreamByKey function that returned an Iterator over a different (lower, sometimes zero) number of elements to that in the input iterator i.e. I wrote a flatMap where I needed a regular Map. This did not cause an error, but produced an incorrect result, seemingly causing early termination such that many keys were never presented to the iterator. So this issue is in two parts: a) using mapStreamByKey should fail with an appropriate exception if the input and output sizes do not match and b) please can we have an equivalent flatMapStreamByKey function in the API. thx.