mengdiwang / guava-libraries

Automatically exported from code.google.com/p/guava-libraries
Apache License 2.0
0 stars 0 forks source link

Multisets method request - highest/lowest frequency / element collection sorted by frequency #356

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
I've been using Multiset recently for frequency counting of objects, then
various tasks sorting by frequency.  Perhaps there's a better way to do
that in general (I'm no expert when it comes to algorithms), but the
Multiset seems a convenient basic data structure to use.

That in mind, it would be useful for the static Multisets had methods for
getting highest/lowest frequency elements, or returning an
ascending/descending collection of the elements (or elements + counts).

An alternative would be some specialized version of TreeMultiset that
sorted by element count first, then individual element comparison.  I tried
that briefly, but couldn't seem to get the comparator right.

Original issue reported on code.google.com by blank...@gmail.com on 29 Apr 2010 at 3:28

GoogleCodeExporter commented 9 years ago
Here are some static methods we can work on releasing:

public static <E> List<Entry<E>> sortedEntries(
    Multiset<E> multiset, Order order) {}

public static <E> List<E> mostFrequent(
    Multiset<E> multiset, int n) {}

public static <E> List<E> leastFrequent(
    Multiset<E> multiset, int n) {}

Sounds like these would help you?

Next, I have often mused about the idea of a "leaderboard" type of Multiset 
which is 
just like any other except always iterates in (increasing/decreasing) count 
order. I 
think I can take this as a request for that?

These two should probably be split.

Thanks for the requests!

Original comment by kevinb@google.com on 29 Apr 2010 at 11:32

GoogleCodeExporter commented 9 years ago
Yes, I would definitely like to see both those static methods added, and the 
leaderboard style multiset.  I'm not 
sure which I'd use for the particular case in front of me, but I could see both 
cleaning up a lot of the boilerplate I 
have.

I suppose I just get excitable as to choice-like to have my options open.

I agree, this request should probably be split into the two different options.

Original comment by blank...@gmail.com on 29 Apr 2010 at 11:58

GoogleCodeExporter commented 9 years ago
Instead of returning List<E> for the most/leastFrequent(n), what about 
returning List<Multiset.Entry<E>>, as is done for sortedEntries?  This would 
provide both the elements and counts directly, without requiring a subsequent 
call to count() to retrieve the count.   Or is the thinking that most callers 
would want just the elements, and the counts can easily be retrieved with the 
additional call to count() ?

Original comment by theycall...@gmail.com on 7 Aug 2010 at 5:57

GoogleCodeExporter commented 9 years ago
I don't understand why the methods are static.

Original comment by dfran...@gmail.com on 8 Aug 2010 at 3:45

GoogleCodeExporter commented 9 years ago
That is, shouldn't they be methods of Multiset?

public <E> List<Entry<E>> sortedEntries(Order order)
public static <E> List<E> mostFrequent(int n)
public static <E> List<E> leastFrequent(int n)

Original comment by dfran...@gmail.com on 8 Aug 2010 at 3:46

GoogleCodeExporter commented 9 years ago
The methods as part of the general Multiset utility class would be static 
obviously.  In general, Multisets *aren't* sorted, so it seems inappropriate to 
require interface implementers to make them so.  And I don't see it as API 
feature *everyone* using the class would want (though obviously, I think enough 
people would want it to warrant inclusion in the utility class).

For a class like what Kevin mentions re a Multiset sorted by element frequency, 
specific instance methods might be appropriate - but I'd rather they weren't 
and such a class instead simply followed a NavigableSet style API, where the 
"comparator" compares by element count.

Original comment by blank...@gmail.com on 8 Aug 2010 at 6:17

GoogleCodeExporter commented 9 years ago
I think this would be _very_ useful.  I use the mostcommon() function from this 
bag class for Python very often:  http://code.activestate.com/recipes/259174/

Original comment by rottenwi...@gmail.com on 24 Sep 2010 at 7:52

GoogleCodeExporter commented 9 years ago

Original comment by fry@google.com on 28 Jan 2011 at 4:06

GoogleCodeExporter commented 9 years ago
I'm very very interested in such a LeaderboardMultiset. If such exists in some 
svn or elsewhere, I'd be very keen to experiment and play with it and provide 
suggestions... 

I've got a custom impl. of something similar to the static methods; it is using 
a TreeSet to copy/store the entries to provide sorted order... And now I'm 
struggling with out of memory errors due to the large volume entries I have in 
my multiset. 

Would the leaderboard implementation provide sorted order of counts without 
keeping two copies around??  I've tried to think how I would implement this, 
but so far I've only come up with solutions that would involve keeping two 
separate collections, one mapping to the count, and another keeping the count 
in sorted order (along with a reference back to the element).

Original comment by hein.mel...@gmail.com on 23 Feb 2011 at 10:53

GoogleCodeExporter commented 9 years ago
Here's an alternative.  If Multiset had a method

    Map<E, Integer> asMap();

Then you could use Multimaps.invertFrom() to construct a new map sorted by 
frequencies in the multiset.

Original comment by fin...@gmail.com on 24 Feb 2011 at 8:51

GoogleCodeExporter commented 9 years ago
Interesting idea. But I guess, my main issue is that keeping two copies of the 
same collection data is difficult due to the large number of entries in my 
multiset, and this approach does not help solve that root problem. 
(invertFrom() makes a copy of every element in the provided Map).

Original comment by hein.mel...@gmail.com on 24 Feb 2011 at 10:28

GoogleCodeExporter commented 9 years ago
You effectively want to maintain the elements sorted in two different orders. 
You *will* need two containers. (But hey, these are merely just pointer 
indexes, not that big to copy).

Original comment by jim.andreou on 25 Feb 2011 at 2:56

GoogleCodeExporter commented 9 years ago
If you only want the top N values you could copy the entries to a 
limited-capacity MinMaxPriorityQueue instead of a Multimap.

Original comment by fin...@gmail.com on 25 Feb 2011 at 1:14

GoogleCodeExporter commented 9 years ago
The Multisets.countFunction() from issue #419 would go a long way toward 
solving this, though the methods seem useful enough to be worth having anyway.

  List<E> topElements = Ordering.natural()
      .onResultOf(Multisets.countFunction(multiset))
      .greatestOf(multiset.elementSet(), k);

Original comment by cgdec...@gmail.com on 14 Apr 2011 at 1:45

GoogleCodeExporter commented 9 years ago

Original comment by kevinb@google.com on 13 Jul 2011 at 6:18

GoogleCodeExporter commented 9 years ago

Original comment by kevinb@google.com on 16 Jul 2011 at 8:37

GoogleCodeExporter commented 9 years ago
Done at head; see 
http://docs.guava-libraries.googlecode.com/git/javadoc/com/google/common/collect
/Multisets.html#copyHighestCountFirst%28com.google.common.collect.Multiset%29

Original comment by wasserman.louis on 2 Oct 2011 at 10:04

GoogleCodeExporter commented 9 years ago
I'm interested in naming suggestions for this method; I find it extraordinarily 
hard to name well.

Original comment by kevinb@google.com on 8 Oct 2011 at 3:01

GoogleCodeExporter commented 9 years ago
Multisets.copyOrderedByFrequency() (shorter, but users might think it's 
actually starting with least frequent elements...)
Multisets.copyOrderedByDecreasingFrequency() (longer, but clearer)
Multisets.immutableCopyOrderedByDecreasingFrequency() (too long IMO...)

I also like finn's asMap() idea, which lets you use the Map<E, Integer> to 
build a Multimap<Integer, E> with the appropriate order.

Another idea would be to expose an "asFrequencyMultimap()" that could be 
ordered from least frequent to most frequent.

Original comment by nev...@gmail.com on 8 Oct 2011 at 4:03

GoogleCodeExporter commented 9 years ago
"count" seems better than frequency - shorter and already used elsewhere in the 
API.  "rank" might also make sense, but it's perhaps muddled relative to the 
formal mathematical definition of rank.  So, using count:

Multisets.countSortedCopy()

I don't think there's a need for a ...Desc() version, since whatever the result 
is should be traversable in reverse.  That is - seems "straightforward" to 
return something like a ImmutableTreeMultiset that offers a descending iterator.

That also provides a nice baseline for, say, .countSortedView(), if modifiable 
view is someday desirable.

Original comment by blank...@gmail.com on 8 Oct 2011 at 11:36

GoogleCodeExporter commented 9 years ago
Good point about "count" being shorter than "frequency" and already used 
elsewhere in the API.

I'd like to change my suggestion to:
Multisets.copyOrderedByCount()
Multisets.copyOrderedByDecreasingCount()

The "decreasing" part is there because, according to the method's javadoc, it 
returns "an ImmutableMultiset whose iteration order is highest count first". I 
thought it would be wise to make this explicit in the method name.

Of course, we could get rid of it and provide a method returning a Multiset 
ordered by *ascending* count instead. This set could potentially be traversed 
in reverse... But I think ascending count is the least useful order, and I'm 
not sure privileging it is the right approach...

Original comment by nev...@gmail.com on 9 Oct 2011 at 5:13

GoogleCodeExporter commented 9 years ago
I'm still pretty attached to the current name, copyHighestCountFirst.  Let me 
restate the arguments for it here:

> Multisets.copyOrderedByFrequency() (shorter, but users might think it's 
actually starting with least frequent elements...)

copyHighestCountFirst is completely unambiguous, while being actually slightly 
shorter.

> Multisets.copyOrderedByCount()
I think this still has the problem that it's not clear which direction counts 
are being ordered in.

Original comment by wasserman.louis on 16 Oct 2011 at 11:35

GoogleCodeExporter commented 9 years ago
> copyHighestCountFirst is completely unambiguous, while being actually 
slightly shorter.
Well, the name only refers to the first element(s). I personally find 
`copyHigh[est]ToLow[est]Counts` to be even more clear.

But then again, I'm unsure about the "copy" prefix and `highestToLowestCounts` 
might do it as well.

Original comment by j...@nwsnet.de on 17 Oct 2011 at 7:40

GoogleCodeExporter commented 9 years ago
Hrrrrm.  That's not how I read it -- I read it as "highest counts come first, 
lower counts come later."

I'm not sure I like "copyHighestToLowestCounts" though, just because it doesn't 
say what "highest to lowest" refers to.

How do you feel about "copyHigherCountsFirst"?

Original comment by wasserman.louis on 17 Oct 2011 at 5:42

GoogleCodeExporter commented 9 years ago
To take a completely different tack:

Why not have a "ImmutableMultiset<T> Multisets.sortedCopy(Multiset<T> src, 
Comparator<T> order)" method, and a "Ordering<T> 
Multisets.higherCountFirstOrdering(Class<T> cls)" method?

I think it's also reasonable to define that the "natural" ordering of a 
multiset is by count (not the natural ordering of it's contents), such that a 
method

ImmutableMultiset<T> Multisets.sortedCopy(Multiset<T> src)

would behave as

ImmutableMultiset<T> Multisets.sortedCopy(Multiset<T> src, 
Multisets.higherCountFirstOrdering(T.class)).

Provides a lot more flex for only a little more code.

Original comment by blank...@gmail.com on 18 Oct 2011 at 4:09

GoogleCodeExporter commented 9 years ago
This is, I'm afraid, not a well-defined ordering at all.  There is no 
implementation for

Ordering<T> Multisets.higherCountFirstOrdering(Class<T> cls)

Original comment by wasserman.louis on 18 Oct 2011 at 4:35

GoogleCodeExporter commented 9 years ago
Ah, yes - that ordering must be in reference to some particular multiset - I 
was thinking too hopefully that there might be someway to get away from that.  
No problem: Ordering<T> Multisets.higherCountFirstOrdering(final Multiset<T> 
ref), implemented as, essentially:

int compare(T left, T right) {
 return ref.count(one) - ref.count(theOther);
}

Original comment by blank...@gmail.com on 18 Oct 2011 at 5:31

GoogleCodeExporter commented 9 years ago
Except, of course, this ordering isn't compatible with equals().

I've thought about this direction, I don't believe it can be made to work, I'm 
afraid.

Original comment by wasserman.louis on 18 Oct 2011 at 5:45

GoogleCodeExporter commented 9 years ago
Can you elaborate on incompatibility with equals()?  I also spaced on that 
earlier, but:

left == right seems to obviously compare to 0, given the way equality works 
with Multiset.

compare = 0 -> left == right seems to be resolvable by either the T being 
already comparable, or by supplying a second comparator (perhaps a bit clunky).

i.e., 

ImmutableMultiset<T> Multisets.countSortedCopy(Multiset<T extends 
Comparable<T>> src)
ImmutableMultiset<T> Multisets.countSortedCopy(Multiset<T> src, Comparator<T> 
nonCountOrder)

and then for the count ordering

private final Comparator<T> ifZero = //...either natural or supplied 
comparator/ordering
int compare(T left, T right) {
  int comp = ref.count(left) - ref.count(right);
  return comp !=0 ? comp : ifZero.compare(left, right);
}

Original comment by blank...@gmail.com on 18 Oct 2011 at 6:21

GoogleCodeExporter commented 9 years ago
or maybe those signatures need to be applied to the method that returns the 
Ordering, but the same idea applies: require either a natural or supplied 
comparator to handle the count(A) == count(B) && A != B case to maintain 
compatibility with equals().

Original comment by blank...@gmail.com on 18 Oct 2011 at 6:25

GoogleCodeExporter commented 9 years ago
The point is, it's so awkward to make this approach work that it's no longer 
worthwhile.

Original comment by wasserman.louis on 18 Oct 2011 at 6:33

GoogleCodeExporter commented 9 years ago
Has anything like this been considered?

  Multisets.countOrderedCopy(multiset).highestToLowest()
  Multisets.countOrderedCopy(multiset).lowestToHighest()

Good:
- separates "ordered by count" from "in which order?"
- reads fairly nicely

Bad:
- requires adding an interface to support it
- possibly confusing since "countOrderedCopy(multiset)" doesn't clearly 
indicate that another call is required to get the final result

Original comment by cgdec...@gmail.com on 18 Oct 2011 at 6:35

GoogleCodeExporter commented 9 years ago
Hmmmm.

We considered extending the ImmutableMultiset builder to support an option like 
this.  I think we concluded, though, that it wasn't really ImmutableMultiset's 
job to provide count-ordering.

That said, I like this...and I can't 100% recall why we rejected something 
along these lines.

Original comment by wasserman.louis on 18 Oct 2011 at 6:40

GoogleCodeExporter commented 9 years ago
caveat: I haven't hooked these up to tests, this is basically just porting what 
I did to solve the problem back when I proposed it into something like the 
solution that was linked above.  No compilation problems at least, right?

The attached is an implementation that I think is not so awkward as to be 
untenable.

It provides:

descCountOrdering() - doc explicitly states it is inconsistent with equals and 
indicates easy ways to make consistent with equals.  I left in comments the 
options to provide methods to do that (countOrderingWithEquals()) as an 
alternative to exposing some autocast singletons.

I've got some commented out rough implementations for the API for the typical 
use case coverage:

countOrderedCopy(Multiset) - just uses descCountOrdering
countOrderedCopy(Multiset, Comparator) - adds secondary ordering on T, easy to 
use with Ordering.<T>natural() or any custom comparator

Original comment by blank...@gmail.com on 20 Oct 2011 at 6:02

Attachments:

GoogleCodeExporter commented 9 years ago
This is some pretty good code -- you made excellent use of the Ordering 
features -- but I don't think this is the right approach.

First off, we don't want to expose to the user any comparators on multiset 
entries.  Specifically, we don't want to expose countOrdering(), 
descCountOrdering(), or anything of the sort.  These are just a means to an end 
-- sorting multisets by element frequency -- and if we're already exposing 
something like countOrderedCopy(Multiset), there's no reason to additionally 
expose countOrdering() or descCountOrdering().

Secondly, we don't want to expose any new tools that take extra orderings on 
the multiset elements.  The only feature we're planning on offering is 
something that lets you take a multiset and get back a copy with the highest 
counts coming first, with ties broken by whichever elements came first in the 
original multiset.  (If users really want ties to be broken according to some 
Comparator<E>, they can just do an intermediate call to 
ImmutableSortedMultiset.copyOf(multiset, myComparator).)

For reference, please remember that sending us code doesn't make it more likely 
that your feature will be accepted to Guava.  Writing code is maybe 10-20% of 
the work we do for each feature we accept: the real work is discussing, 
arguing, and naming things, and even so we reject most feature requests.  In 
the future, please work out an API with us, and make sure everybody's 
satisfied, before you spend time writing patches.

Original comment by wasserman.louis on 20 Oct 2011 at 6:48

GoogleCodeExporter commented 9 years ago
Ah, I suppose I'm still pining for the TreeMultiset that sorts on count per my 
original request, (or LeaderboardMultiset like in comment #9), hence the 
continuous pestering.

As for the rest, my own solution to the problem I faced a year ago did expose 
those things (and they've found some use) but not exposing them has pros as 
well (e.g., plenty of ways to misuse), and I find actually writing code a 
better way to think and communicate about any API - not from a misunderstanding 
about what gets features into Guava.

Anyway, thanks for the detailed rejection, and I look forward to the removal of 
@Beta, however the method is finally named/signatured!

Original comment by blank...@gmail.com on 20 Oct 2011 at 7:27

GoogleCodeExporter commented 9 years ago

Original comment by fry@google.com on 10 Dec 2011 at 3:45

GoogleCodeExporter commented 9 years ago
Done as of release 11 with copyHighestCountFirst.

Original comment by wasserman.louis on 28 Dec 2011 at 11:54

GoogleCodeExporter commented 9 years ago
This issue has been migrated to GitHub.

It can be found at https://github.com/google/guava/issues/<id>

Original comment by cgdecker@google.com on 1 Nov 2014 at 4:15

GoogleCodeExporter commented 9 years ago

Original comment by cgdecker@google.com on 3 Nov 2014 at 9:10