kucci / guava-libraries

Automatically exported from code.google.com/p/guava-libraries
Apache License 2.0
0 stars 0 forks source link

HashMultimap should choose stingier sizing defaults #447

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
HashMultimap creates value collections quite pessimistically by default - 
expecting 8 elements per key. This may increase too much the cost for sparse 
multimaps. While this setting can be defined by the user, we all know that the 
vast majority of people use the defaults without thinking. :) I'd suggest 
starting from a lower point (few doublings of small arrays to reach the current 
default size wherever needed should be cheap enough), even just 4 instead of 8 
(personally I would start even lower). Or one could be more aggressive and 
create special representations for tiny sets (as for example ImmutableSet), but 
that would be a lot of work indeed.

For illustration, I copy from: 
http://code.google.com/p/memory-measurer/wiki/ElementCostInDataStructures

HashMultimap (Worst) -- Objects =  5.00 Refs = 29.00 Primitives = {float=1.0, 
int=5.0}
HashMultimap (Best) -- Objects =  1.00 Refs =  5.00 Primitives = {int=1.0}
HashSet -- Objects =  1.00 Refs =  5.00 Primitives = {int=1.0}

This is the (average) cost per user-entry in a default HashMultimap. In the 
worst case, each entry has a unique key, while in the best case, all entries 
have the same key (the latter has obviously the exact same cost-per-entry as 
adding something to a HashSet). In a realistic scenario, the actual cost falls 
somewhere in between, depending on the sparsity of the multimap.

In the worst case, each entry adds 5 objects (*not* counting the key/value 
objects of the user) in the multimap, 29 references (most of them are due to 
the big HashSet allocated), and several primitives, that should be roughly 5*2 
+ 29 + 6 = 45 words, which seems quite a lot.

Original issue reported on code.google.com by jim.andreou on 11 Oct 2010 at 12:06

GoogleCodeExporter commented 9 years ago
Perhaps with the first value for a key k it should merely put an ImmutableSet 
in its map? Only at the second go up to a HashSet with table size 4 or 8?

Any new thoughts on this, "Jim"? :-)

Original comment by kevinb@google.com on 18 Jul 2011 at 3:44

GoogleCodeExporter commented 9 years ago

Original comment by kevinb@google.com on 18 Jul 2011 at 4:15

GoogleCodeExporter commented 9 years ago
No new thoughts, and more importantly, no data. Wouldn't it be cool if somehow 
we got such data? E.g. "median size of values", "what % of them are just 
size==1". I don't know how :(. I think someone recently mentioned internally 
that initial values were excessively big. 

What would it take to decide for a lower default size?

Original comment by jim.andreou on 18 Jul 2011 at 6:30

GoogleCodeExporter commented 9 years ago

Original comment by fry@google.com on 28 Jul 2011 at 5:50

GoogleCodeExporter commented 9 years ago

Original comment by fry@google.com on 10 Dec 2011 at 3:51

GoogleCodeExporter commented 9 years ago
We're in a position to get more data on this, no?  Now that we have Dimitris' 
memory-measurement tool?

Original comment by wasserman.louis on 12 Feb 2012 at 5:42

GoogleCodeExporter commented 9 years ago
Not really, nobody would want this tool to run in production code :)

But we could make a reasonable decision without additional data. E.g., see 
what's the relationship between [Linked]HashMap and TreeMap:
HashMap :: Bytes =  32.24
LinkedHashMap :: Bytes =  40.24
TreeMap :: Bytes =  32.00

Wouldn't it be reasonable to expect a similar relationship between 
[Linked]HashMultimap and TreeMultimap? 

HashMultimap_Worst :: Bytes = 192.24
HashMultimap_Best  :: Bytes =  32.24

LinkedHashMultimap_Worst :: Bytes = 328.48
LinkedHashMultimap_Best  :: Bytes =  96.48

TreeMultimap_Worst :: Bytes = 128.00
TreeMultimap_Best  :: Bytes =  32.00

We could just say that "the average case" (most probably untrue) would be this:

HashMultimap :: Bytes = 112
LinkedHashMultimap :: Bytes = 212
TreeMultimap :: Bytes = 80

And that these numbers should follow about the same relation as in the first 
case, e.g. HashMultimap should be 80 bytes, and LinkedHashMultimap shoulb be 
106 bytes (+20% from the tree case).

By "most probably untrue" I mean that this underestimates the true cost if it 
happens that most of multimap instances are sparse (thus the cost would 
approach the worst case, instead of the best). Which is very likely, given that 
multimaps is the type to do basic, casual graph operations, and that most real 
world graphs are sparse.

Original comment by jim.andreou on 13 Feb 2012 at 9:35

GoogleCodeExporter commented 9 years ago
What I meant to suggest was "pick some real-world usage of HashMultimap, 
attempt to reproduce it in a testing environment, measure the used memory with 
the current defaults and with changed defaults, see the extent of the 
improvement."

Original comment by wasserman.louis on 13 Feb 2012 at 4:08

GoogleCodeExporter commented 9 years ago
Too much work just to get a single data point...

Original comment by jim.andreou on 13 Feb 2012 at 5:35

GoogleCodeExporter commented 9 years ago
Did another test, just using the minimum defaults for HashMultimap and 
LinkedHashMultimap (i.e. constructing them with create(1,1)). The worst case 
for HashMultimap is 136 bytes (compared to 192), whereas the worst case for 
LinkedHashMu

HashMultimap_Worst :: Bytes = 136 (instead of 192)
LinkedHashMultimap_Worst :: Bytes = 280 (instead of 328)

Which would bring the """average""" case down to:
HashMultimap_Avg :: Bytes = 84 (instead of 112), almost as good as 
TreeMultimap's 80
LinkedHashMultimap_Avg :: Bytes = 188 (instead of 212)

So, HashMultimap could definitely see some adjusting, though LinkedHashMultimap 
seems like a lost cause. 

Original comment by jim.andreou on 13 Feb 2012 at 11:42

GoogleCodeExporter commented 9 years ago
For 12.0 why don't we do nothing but change the default expectedValuesPerKey 
from 8 to 2?

Original comment by kevinb@google.com on 16 Mar 2012 at 9:40

GoogleCodeExporter commented 9 years ago
We're changing HashMultimap from 8 to 2, and ArrayListMultimap from 10 to 3.

Original comment by kevinb@google.com on 22 Mar 2012 at 5:47

GoogleCodeExporter commented 9 years ago

Original comment by cpov...@google.com on 23 Mar 2012 at 7:26

GoogleCodeExporter commented 9 years ago

Original comment by cpov...@google.com on 23 Mar 2012 at 7:26

GoogleCodeExporter commented 9 years ago

Original comment by kevinb@google.com on 26 Mar 2012 at 9:40

GoogleCodeExporter commented 9 years ago
This issue has been migrated to GitHub.

It can be found at https://github.com/google/guava/issues/<id>

Original comment by cgdecker@google.com on 1 Nov 2014 at 4:15

GoogleCodeExporter commented 9 years ago

Original comment by cgdecker@google.com on 3 Nov 2014 at 9:09