Add support for Seq.zipAll()

jOOQ / jOOL

jOOλ - The Missing Parts in Java 8 jOOλ improves the JDK libraries in areas where the Expert Group's focus was elsewhere. It adds tuple support, function support, and a lot of additional functionality around sequential Streams. The JDK 8's main efforts (default methods, lambdas, and the Stream API) were focused around maintaining backwards compatibility and implementing a functional API for parallelism.

http://www.jooq.org/products

Apache License 2.0

2.08k stars 167 forks source link

Add support for Seq.zipAll() #187

Closed lukaseder closed 8 years ago

lukaseder commented 8 years ago

The current Seq.zip() method "inner zips" two streams, such that the size of the resulting stream is the minimum of the size of each zipped stream.

Sometimes, it is useful, however, to fully zip all streams, providing Optional values (or nulls) once the shorter stream is depleted. Example:

// ((1, "A"), (2, "B"))
Seq.of(1, 2, 3).zip(Seq.of("A", "B"));

// ((1, "A"), (2, "B"), (3, null))
Seq.of(1, 2, 3).leftOuterZip(Seq.of("A", "B"));

// ((1, "A"), (2, "B"), (null, "C"))
Seq.of(1, 2).rightOuterZip(Seq.of("A", "B", "C"));

danieldietrich commented 8 years ago

It is a special case of what Scala calls zipAll:

interface Seq<T> {
    // thisElem and thatElem used to fill up the result if this or that runs out of elements
    <U> Seq<Tuple2<T, U>> zipAll(Seq<U> that, T thisElem, U thatElem);
}

with thisElem and thatElem set to null.

lukaseder commented 8 years ago

Yep, true. Thanks for the hint.

danieldietrich commented 8 years ago

I thought again about outer zips. In contrast to DB joins, zip is very limited. It composes records sequentially, one-by-one. The only match-condition the records meet is their position.

Outer joins handle the case where for a given record of one relation no record of the other relation matches the condition. zipAll handles both left-and right outer join simulaneously by providing default values. Therefore zipAll is not a special case of outer zips.

lukaseder commented 8 years ago

Well... In DB Speak, the following can be said:

-- Convenience form
A [ { LEFT | RIGHT | FULL } [ OUTER ] ] ZIP B

-- Full form
(SELECT A.*, ROW_NUMBER() OVER() AS RN FROM A) AS A
[ { LEFT | RIGHT | FULL } [ OUTER ] ] JOIN
(SELECT B.*, ROW_NUMBER() OVER() AS RN FROM B) AS B
ON A.RN = B.RN

zipAll() is an alias for fullOuterZip()

danieldietrich commented 8 years ago

Thank you. If we had functions

Seq<T> continue(T next);
Seq<T> continue(Supplier<? extends T> nextSupplier);
Seq<T> continue(Function<? super T, ? extends T> previousToNext);

then

// = leftOuter
Seq.of(1, 2, 3).zip(Seq.of("A", "B").continue(null));

// = rightOuter
Seq.of("A", "B", "C").zip(Seq.of(1, 2).continue(null)).map(t -> t.swap());

lukaseder commented 8 years ago

Hmm, good point (although, your rightOuter suggestion appears incorrect to me, the resulting stream should contain only ((A, 1), (B, 2))). And continue(T) is nothing but concat(Seq.generate(T)).

danieldietrich commented 8 years ago

your rightOuter suggestion appears incorrect to me, the resulting stream should contain only ((A, 1), (B, 2)))

should be right, the Tuple2 is swapped (couldn't come up with a better solution yet)

And continue(T) is nothing but concat(Seq.generate(T))

generate(T) is a special case of continue(T): empty().continue(T)

lukaseder commented 8 years ago

That would be my understanding of right outer zip:

// = rightOuter (no effect)
Seq.of("A", "B", "C").continue(null).zip(Seq.of(1, 2));

// = rightOuter (with effect)
Seq.of("A", "B", "C").continue(null).zip(Seq.of(1, 2, 3, 4));

generate(T) is a special case of continue(T): empty().continue(T)

Eh, are we talking about the same thing? :) Here's my naive continue(T) implementation:

interface Seq<T> {
    default Seq<T> continue(T t) {
        return concat(Seq.generate(t));
    }
}

Or, what am I missing?

danieldietrich commented 8 years ago

I took your example from above:

// ((1, "A"), (2, "B"), (null, "C"))
Seq.of(1, 2).rightOuterZip(Seq.of("A", "B", "C"));

The continue(null) should happen in the zip method argument, e.g. finite.zip(that.continue(null)). Otherwise we do not terminate, e.g. infinite.zip(that).

Therefore I needed to change your example like this:

Seq.of("A", "B", "C")
    // = ("A", 1), ("B", 2), ("C", null)
    .zip(Seq.of(1, 2).continue(null))
    // = (1, "A"), (2, "B"), (null, "C")
    .map(t -> t.swap());

Eh, are we talking about the same thing? :) Here's my naive continue(T) implementation:

yes, you're right - I've overseen that.

lukaseder commented 8 years ago

Why not simply

default <U> Seq<Tuple2<T, U>> rightOuterZip(Seq<? extends U> right) {
    return continue(null).zip(right);
}

No swapping needed...

danieldietrich commented 8 years ago

Oh, right - my fault. Was a little bit undercoffeeinated... :-)

danieldietrich commented 8 years ago

Oh noes - continue is a keyword...

Alternatives (see https://www.powerthesaurus.org/continue/synonyms):

extend
expand
advance
...

lukaseder commented 8 years ago

Well, I personally don't think that continue() is a first-level concept. It can be really trivially be achieved via concat(generate(...)) :)

danieldietrich commented 8 years ago

Yes, you are right I think. API shouldn't be redundant.

patrox commented 8 years ago

@lukaseder, @danieldietrich, I thought a bit about how zipAll could look like and inspired heavily by scala's zipAll I ended up with sth like this:

 static <T1, T2, R> Seq<R> zipAll(Seq<T1> s1, Seq<T2> s2, BiFunction<? super T1, ? super T2, ? extends R> zipper, T1 thisElem, T2 thatElem) {
        final Iterator<T1> it1 = s1.iterator();
        final Iterator<T2> it2 = s2.iterator();

        class ZipAll implements Iterator<R> {
            @Override
            public boolean hasNext() {
                return it1.hasNext() || it2.hasNext();
            }

            @Override
            public R next() {

                if (it1.hasNext()) {
                    if (it2.hasNext()) {
                        return zipper.apply(it1.next(), it2.next());
                    } else {
                        return zipper.apply(it1.next(), thatElem);
                    }
                } else {
                    if (it2.hasNext()) {
                        return zipper.apply(thisElem, it2.next());
                    } else {
                        throw new NoSuchElementException("next on empty iterator");
                    }
                }
            }
        }

        return seq(new ZipAll());
    }

What do you think ?

danieldietrich commented 8 years ago

Yes, it is a special method for this, how I would express it:

s1.zipAll(s2, thisElem, thatElem).map(tuple -> tuple.transform(zipper))

but without creating intermediate Tuples which relaxes GC.

lukaseder commented 8 years ago

Implementation-wise that's certainly one valid option. Do you also see a way to do this via Spliterator?

Also, given that the zipper is an optional convenience argument, I'd put it at the end and overload:

static <T1, T2> Seq<Tuple2<T1, T2>> zipAll(Seq<T1> s1, Seq<T2> s2, T1 default1, T2 default2) {
    return zipAll(s1, s2, default1, default2, (t1, t2) -> tuple(t1, t2));
}
static <T1, T2, R> Seq<R> zipAll(Seq<T1> s1, Seq<T2> s2, T1 default1, T2 default2, BiFunction<? super T1, ? super T2, ? extends R> zipper) {}

patrox commented 8 years ago

@lukaseder agree regarding the zipper as the version which I posted was a draft. I'll refine it in the evening, once I get to the keyboard :)

lukaseder commented 8 years ago

Looking forward! :)

In case you do want to look into using a Spliterator, you might want to consider delegating to SeqUtils.transform(), which makes it a bit easier.

patrox commented 8 years ago

@lukaseder I checked SeqUtils.transform() and how it's used in other methods in Seq, but since the the method is processing two Streams (Seqs) I decided to stay with using Iterator - but I'm quite curious how a corresponding Spliterator implementation would look like.

BTW, I have found a very interesting info, that at some point, Stream.zip was part of (very early) jdk8:

http://download.java.net/lambda/b93/docs/api/java/util/stream/Streams.html#zip(java.util.stream.Stream, java.util.stream.Stream, java.util.function.BiFunction)

Here is the PR (with a nice, round number): https://github.com/jOOQ/jOOL/pull/200

lukaseder commented 8 years ago

You're right. SeqUtils.transform() won't work here. Interesting info indeed. I wonder why it was removed! Probably some caveat with respect to parallelisation. Also, this class was called Streams, which would be the expected name. Now, we have the weird StreamSupport. Who knows what happend

amaembo commented 8 years ago

Seems that the main reason for removal was the API blow up for primitive specializations. You would need

zip(IntStream, IntStream, IntBinaryOperator);
zip(LongStream, LongStream, LongBinaryOperator);
zip(DoubleStream, DoubleStream, DoubleBinaryOperator);

But that's only beginning. What about

<T, R> zip(IntStream, Stream<T>, IntObjToObjFunction<T, R>);
zip(IntStream, LongStream, IntLongToWhateverFunction);

etc, etc. The performance is one of the main goals of the Stream API (though not very important for jOOL =)). So it was decided to remove zip at all than to ignore primitive specializations. I can guess, that we will see zip in JDK only when project valhalla will be ready.

The Streams class was created when interface static methods were not implemented yet. Most of it migrated into Stream/IntStream/LongStream/DoubleStream later.

lukaseder commented 8 years ago

That's one possible reason. Although, one might argue that this whole specialisation was just a bit premature, because hopefully, Valhalla will get rid of all these types again :)

patrox commented 8 years ago

@lukaseder, @amaembo openjdk mail archive explains quite a lot:

http://mail.openjdk.java.net/pipermail/lambda-libs-spec-observers/2013-June/002028.html

lukaseder commented 8 years ago

OK, done. Thanks again for the contribution!