google / guava

Google core libraries for Java
Apache License 2.0
50.03k stars 10.86k forks source link

Better support for streaming over map entries or other logical pairs of elements #2632

Open jbduncan opened 7 years ago

jbduncan commented 7 years ago

As discussed in https://github.com/google/guava/issues/2591#issuecomment-258624388, I wonder if it'd be worth adding a map stream class to Guava 21.0+, in the spirit of OpenGamma Strata's MapStream (javadoc) or StreamEx's EntryStream (javadoc).

jbduncan commented 7 years ago

@cpovirk I believe this issue should have "platform: java 8" and "package: collect' labels? :)

jbduncan commented 7 years ago

Thanks for adding them @cpovirk! :)

PhilippWendler commented 7 years ago

I would also like such a class, though I would suggest to make it more general and not only focused on maps.

It could look like this:

public interface PairStream<A, B> {

  static <A, B> PairStream<A, B> empty();

  // Zip methods, with different handling of iterables of different length:
  // throwing exception, truncating, or padding with null
  static <A, B> PairStream<A, B> zip(Iterable<A> a, Iterable<B> b) { }
  static <A, B> PairStream<A, B> zipShorter(Iterable<A> a, Iterable<B> b) { }
  static <A, B> PairStream<A, B> zipPadded(Iterable<A> a, Iterable<B> b) { }

  static <T, A, B> PairStream<A, B> of(Iterable<T> it, Function<T, A> aFunc, Function<T, B> bFunc) { }

  // Additional Stream variants of factory methods where possible.

  // Maybe additional convenience method for maps?
  static <A, B> PairStream<A, B> of(Iterable<Map.Entry<A, B>> entryStream) { }
  static <A, B> PairStream<A, B> ofEntries(Map<A, B> map) { }

  <R> PairStream<R, B> mapLeft(BiFunction<A, B, R> f);
  <R> PairStream<A, R> mapRight(BiFunction<A, B, R> f);
  <R> Stream<R> map(BiFunction<A, B, R> f);

  // Maybe convenience methods for simple projections?
  Stream<A> toLeft();
  Stream<B> toRight();

  // flatMap(Left|Right) would pair each of the entries of the resulting stream
  // with the current element of the other side
  <R> PairStream<R, B> flatMapLeft(BiFunction<A, B, Iterable<R>> f);
  <R> PairStream<A, R> flatMapRight(BiFunction<A, B, Iterable<R>> f);

  <R, S> PairStream<R, S> flatMap(BiFunction<A, B, PairStream<R, S>> f);

  // Sorted keeps pairs together, also overrides with Comparators
  PairStream<A, B> sortedLeft();
  PairStream<A, B> sortedRight();

  // Instead of collectors:
  ImmutableMap<A, B> toMap();
  ImmutableMap<A, B> toMap(BinaryOperator<B> reducer);
  ImmutableMultimap<A, B> toMultimap();

  // Remaining Stream methods that can be adapted directly:

  PairStream<A, B> filter(BiPredicate<A, B> p);

  void forEach(BiConsumer<A, B> c);
  PairStream<A, B> peek(BiConsumer<A, B> c);

  PairStream<A, B> distinct();

  PairStream<A, B> limit(long maxSize);
  PairStream<A, B> skip(long n);
  long count();

  boolean anyMatch(BiPredicate<A, B> p);
  boolean allMatch(BiPredicate<A, B> p);
  boolean noneMatch(BiPredicate<A, B> p);
}

StreamEx's EntryStream seems to have a few nice additional methods that could be added here, too.

Besides the use case for maps, this PairStream would also be an elegant solution for the proposed Iterators.forEachPair (https://github.com/google/guava/issues/677#issuecomment-288425903).

It does not require any Pair class nor any new functional interfaces added to the API. Of course, the common argument against any Pair-related features is that pairs are bad and one should use "proper" classes instead, but I think specifically in this case the counterargument is that having a PairStream with methods that take BiFunctions and BiPredicates is nicer than having a Stream<MyDataClass>, because one can directly access both values in the lambdas.

jbduncan commented 7 years ago

For everyone's info, it seems that JOOQ JOOL is investigating something similar to @PhilippWendler's idea as a class called Seq2. Perhaps that might be a source of inspiration or things to avoid for this particular issue?

https://github.com/jOOQ/jOOL/pull/172

jbduncan commented 7 years ago

Looks like a Streams.forEachPair() has been added as of https://github.com/google/guava/commit/8f1a088475b51f03b9f28617893510cc74b82b49.

jbduncan commented 7 years ago

I think @PhilippWendler's idea for a PairStream has merit, but if the name puts the Guava team off because of the association with Pair, then we could potentially name it DuoStream instead. :)

jbduncan commented 7 years ago

@perceptron8 I noticed that you down-voted the idea of a MapStream. I'd be interested to hear why you think it's not such a good idea. :)

perceptron8 commented 7 years ago

@jbduncan It seems that I checked StreamEx's EntryStream only and I found it to be excessively complex, even bloated. I should've pay more attention to OpenGamma Strata's MapStream which is nice, simple and - I agree - could be useful.

It's great that you linked Seq2 PR. Flattened Pair is still a Pair, just implicit. PairStream, DuoStream, BiStream... different names, same mess.

I withdraw from down-voting the MapStream, as it is definitely worth considering. However, I'm still against EntryStream and any Pair/Tuple-involving ideas.

Thanks for asking :)

jbduncan commented 7 years ago

Looks like Google MµG has just released an implementation of a BiStream class.

It's not clear to me what the relationship between Guava and MµG is, so is it worth keeping this issue open now?

kevinb9n commented 7 years ago

There's no relationship. (Except that we like Ben and he's contributed more than a few cool things to Guava in the past.) MµG is basically a "personal playground" project, which I don't mean to sound derogatory (really, we should all have one of those).

Guava APIs have to jump through an infamous number of hoops before being approved; projects like this have essentially no hoops at all... each is good in its own way.

fluentfuture commented 7 years ago

+1 for adding this to Guava.

I intended Mug to be a niche library for things that seem to have potential but don't quite make Guava's "utility * ubiquity" bar.

If Guava gets a BiStream or something of that kind, Mug BiStream will be deprecated in favor of Guava.

I'd also volunteer to work on adding BiStream in Guava if the librarians aren't already working on it or plan to (and if the idea itself is approved).

kevinb9n commented 7 years ago

I see mostly just two main possibilities: BiStream, or we embrace Stream<Map.Entry> as the common idiom, despite its flaws, and add bits of help here and there to make that less painful. I can't see introducing a Pair or Pair-alike at this stage, and I can't really see just doing nothing about this either.

I do want to dig through the openjdk archives to discover why they dropped BiStream in the first place; that may yield useful intel.

One nice thing about BiStream is we could add stream() methods to Multimap and ImmutableMap that return it directly. That would be quite pleasant.

fluentfuture commented 7 years ago

A pitch for the idea.

I too think its utility is beyond creating Maps. I've seen people creating List<Pair<A, B>> or even returning it. Such code is all over the places. Some I guess could have used Map or Multimap but a common argument could be that the user didn't logically need a map, but rather just trying to process some arbitrary pairs that are only meaningful in a method body (like, if you are trying to look at Doctor-Patient pairs to do some analysis).

An alternative is to create value objects. Thanks to autovalue, it's a lot easier than it used to be. But there can be cases where a class name is hard to come by (is it DoctorAndPatient?)

What I like about BiStream is that it can provide another alternative to these otherwise List<Pair<A, B>> use cases.

In addition to List<Pair<>>, I've run into the following use cases a few times:

  1. Processing stream elements with indices (stackoverflow).
BiStream.indexed(inputs)  // Returns a BiStream<Integer, V>
    .forEach((i, v) -> println(i + ": " + v));
  1. Guava's Streams.zip() and Streams.forEachPair(). I think BiStream would be a better home for them, and they will be more flexible:
BiStream.zip(doctors, patients)
    .filter((doctor, patient) -> doctor.likes(patient))  // No equivalence today
    .map((doctor, patient) -> ...)  // Equivalent to Streams.zip()
    .forEach((doctor, patient) -> ...);  // Equivalent to Streams.forEachPair()

Overall, I'm hoping a BiStream can eliminate another 50% of use cases where Pair may still have some appeals.

kevinb9n commented 7 years ago

Eliminating usages of Pair is a compelling argument. :-)

fluentfuture commented 7 years ago

I got curious and went digging openjdk archive.

Found two points explained by openjdk devs.

No killer use case:

For what its worth, we gathered a lot of uses cases for MapStream and found that the vast majority were better served by what has now become Collector.

For example, the most commonly suggested use case was tabulations like:

Map<Author, Integer> totalPagesByAuthor = docs.stream().groupBy(Doc::getAuthor) // MapStream<Author, List> .mapValues(v -> v.stream().map(Doc::pageCount).sum());

Doing this as a reduce is just as simple and actually far more efficient (and more flexible too).

Map<Author, Integer> totalPagesByAuthor = docs.stream().collect(groupingBy(Doc::getAuthor), sumBy(Doc::getPageCount)));

When we saw that what seemed like 90% of the use cases were better served by better combinators for reduction, the return-on-complexity took a nose dive.

Some of the remaining use cases are acceptably served by streams of Map.Entry (though this is clearly a slippery slope since once you stray off the happy path it gets hostile fast.)

So given the substantial incremental cost in complexity and code size, we're still looking for killer use cases that would justify this.

No tuple

However, it would likely stick out like a sore thumb when Value types and then Tuples arrive (when? “when they are ready!”), and then we can do much better.

So, groupingBy() was once considered a killer use case for MapStream but then it was realized that it can be done better as Collector.

And mapping from (k1, v1) to (k2, v2) in MapStream/BiStream is clunky without language tuple support -- mapping to a Map.Entry is weird at best. In Mug, I'm hoping users can use one of the following workarounds:

  1. call map() to go from (k, v) to AProperValueObject.
  2. call forEach((k, v) -> ...).
  3. call flatMap2((k, v) -> anotherBiStream) or, heck, flatMap2((k, v) -> BiStream.of(k2, v2)).
  4. or maybe toMap() and toMultimap() can cover some use cases.

This is also why Mug didn't add BiStream#min()/max() because having to return and use Map.Entry feels kinda unfortunate.

From openjdk perspective though, Pair will naturally disappear when tuple is added to the language so I guess in the long run it may be a non-goal or at least premature to solve yet.

kevinb9n commented 7 years ago

I would not assume that tuples will ever be added to the language. As for value types, they are trying, but the challenges are great.

kevinb9n commented 7 years ago

Anyway, good to know that their arguments against aren't particularly dissuasive for us.

PhilippWendler commented 7 years ago

Streams.zip and Streams.forEachPair already handle a good part of my use cases for PairStream. However, one relatively large class of use cases is still missing: zipping over the entries of a map. With having zip and forEachPair for streams in Guava, it really feels like an omission that the JDK provides only forEach for maps and not zip.

Could you please add Stream<R> Streams.zipMapEntries(Map<K, V> map, BiFunction<K, V, R> func) to Guava? I think this would be a worthy addition that does not suffer from any of the open questions for PairStream. An alternative would be to add this method to Maps, but I think it would be less discoverable there (in Streams, people will notice it when they look for Streams.zip). Of course, adding an instance method to ImmutableMap in addition would also be great.

zipMapEntries would only be a shortcut for map.entrySet().stream().map(entry -> func.apply(entry.getKey(), entry.getValue())), but I think it is still valuable because it often allows using method references instead of lambdas that unpack the entries, and because being able to assign names to the key and value parameters in a lambda transports more semantics than just calling entry.getKey() and entry.getValue().

jodastephen commented 6 years ago

As mentioned above, at OpenGamma we have the MapStream class (Javadoc, code) which provides a stream over Map.Entry.

I think it is safe to say it is one of our most popular utility classes internally. We use it all over our codebase (hundreds of uses), and it makes the relevant pieces of logic much easier to understand. I think it would make a great addition to Guava. In fact the surprise is that it hasn't been added yet!

Our implementation works on Map.Entry which seems like the best available pair tuple in the JDK. The design hides actual use of Map.Entry as much as it can though, favouring BiFunction etc. Our implementation implements Stream<Map.Entry>, but I'm not sure that was actually necessary or wise, as it doesn't tend to get used in that way (and compromises a couple of method names). If there are any more design/experience questions I'm happy to answer.

Is there anything else I can do to help this happen?