vigna / fastutil

fastutil extends the Java™ Collections Framework by providing type-specific maps, sets, lists and queues.
Apache License 2.0
1.8k stars 197 forks source link

Add SortedArrayMaps? #73

Open pburka opened 7 years ago

pburka commented 7 years ago

I have an application where we need many sorted maps of longs to Objects, and we wish to use a very space efficient implementation. We use a custom map backed by a sorted long[] and a corresponding Object[]. This is very similar to fastutil's Long2ObjectArrayMap, except that it provides a NavigableMap interface. Entries are found using a simple binary search. In our expected use case, keys are added to the map in order, so we're not worried about the performance of adding or removing arbitrary keys.

Would it make sense to add this functionality to fastutil? e.g. Long2ObjectSortedArrayMap

vigna commented 7 years ago

That'd be nice. Wanna try a pull request? First we should fastutilize NavigableMap/Set tho (sigh).

pburka commented 7 years ago

Agreed. Without the NavigableMap interface it's not of much use to me. I'll see if I can make some progress on that.

mdekstrand commented 7 years ago

We've got an implementation of this over in LensKit too, with few nice features:

We plan to soon relicense LensKit under the MIT license (just waiting for the last paperwork to clear), at which point our source code will be reusable in fastutil if desired.

pburka commented 7 years ago

I started working on this today by prototyping a NavigableSet template for fastutil. I ran into one complication: a number of NavigableSet methods can return null to indicate absence (e.g. ceiling(e) returns null if there is no element greater than or equal to e).

I believe the full set of affected methods is:

In Maps, you deal with this by adding a defaultReturnValue(). We could do the same thing in NavigableSet, but (a) it means adding additional state to each implementor, and (b) I'm not convinced that it's the best solution.

Since functions like floor() and ceiling() deal with inequalities, a user may, perhaps, prefer to receive MAX_VALUE as the 'absent' signal for one, and MIN_VALUE for the other.

How would you feel about a fastutil NavigableSet which supplemented Long ceiling(Long e) with long ceiling(long e, long ifNone), etc?

/peter

On Wed, Jun 21, 2017 at 10:55 AM Sebastiano Vigna notifications@github.com wrote:

That'd be nice. Wanna try a pull request? First we should fastutilize NavigableMap/Set tho (sigh).

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/vigna/fastutil/issues/73#issuecomment-310104899, or mute the thread https://github.com/notifications/unsubscribe-auth/ADxoy7_o03NnZdkaQwQfv8KYkmVTPh4jks5sGS7egaJpZM4OBCOq .

vigna commented 7 years ago

I fully understand your problem. A default return value is sensible for a non-sorted operation such as contains(), but it's utterly counterintuitive when order is involved. I think your proposal is reasonable, but let's season it for a couple of days. I'll think about it.

pburka commented 7 years ago

I've prototyped NavigableSet, NavigableMap, and some helpers in NavigableSets in a fork of the repository at https://github.com/pburka/fastutil/tree/master/ https://github.com/pburka/fastutil/tree/master/drv

I chose to append 'OrDefault' to each of the APIs which could return null, as this seemed consistent with the new Java 8 API, Map.getOrDefault().

Take a look and tell me what you think.

/peter

On Sun, Jun 25, 2017 at 2:36 PM Sebastiano Vigna notifications@github.com wrote:

I fully understand your problem. A default return value is sensible for a non-sorted operation such as contains(), but it's utterly counterintuitive when order is involved. I think your proposal is reasonable, but let's season it for a couple of days. I'll think about it.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/vigna/fastutil/issues/73#issuecomment-310919802, or mute the thread https://github.com/notifications/unsubscribe-auth/ADxoy1xqxLBT-ezkAKryfMvE5fcBXYSDks5sHqi5gaJpZM4OBCOq .

vigna commented 7 years ago

On 29 Jun 2017, at 02:24, Peter Burka notifications@github.com wrote:

Take a look and tell me what you think.

I'm a little behind schedule with this, but I'm working on it :).

Ciao,

                seba
incaseoftrouble commented 7 years ago

What about NoSuchElementException? This would nicely apply to all the relevant methods. For pollFirst and pollLast one can easily check whether the exception would be thrown, for the other methods (ceiling, etc.) something like hasLower and hasHigher would be necessary. Those can be defaulted to .tailMap(e).isEmpty().

This approach somehow conflicts with the "default return value" idea, but I am unsure which of the two is more desirable for the typical use cases. To be honest I'd even like to have this "no such element"-behavior for normal maps since it makes debugging a lot easier in some cases. With the advent of getOrDefault I'd even go as far as deprecating default return values completely, since the fastutil-get essentially only is a m.getOrDefault(k, m.defaultReturnValue()), but that is a separate issue.

How would you feel about a fastutil NavigableSet which supplemented Long ceiling(Long e) with long ceiling(long e, long ifNone), etc?

I don't think that this would be used too frequently, but I am not entirely sure about that. If you want to check if the returned value is contained in the set you still have to call something like contains(long e) (except there is some domain knowledge).

One more idea would be to borrow from Optional and have something like .applyCeiling(long e, LongConsumer consumer) but I fear this might be more heavy-weight than just checking with contains(long e)

pburka commented 7 years ago

I'm not sold on throwing NoSuchElementException. First, it's inconsistent with the object versions, which return null. Secondly, it's inefficient. Either you need to probe the map twice (once to check for presence, and again to retrieve the element), or you need to handle an exception, which is usually inefficient.

I think it's particularly awkward to throw exceptions for pollFirst() and pollLast(), since those methods exist specifically to avoid the exceptions of first() and last().

I suspect that users do have domain knowledge in most cases, allowing them to choose values which are either known to be out-of-set, or which are appropriate for the none case. For example, it may make sense to call floorOrDefault(key, Long.MIN_VALUE).

I do agree that deprecating the universal default return values might be a good idea. I think that getOrDefault is easier to reason about than a data structure which carries around its own default value.

incaseoftrouble commented 7 years ago

First, it's inconsistent with the object versions, which return null.

Well, depending on how you look at it. ceiling(Long).longValue() also throws an exception when there is no such value. Similarly, the default return value mechanism is inconsistent with the object-versions. It's more about coming up with the most "intuitive" inconsistency.

Secondly, it's inefficient. Either you need to probe the map twice (once to check for presence, and again to retrieve the element), or you need to handle an exception, which is usually inefficient.

I agree on this, exceptions definitely can't be used as control flow (there was some nice article on that somewhere providing some numbers). I assumed a different kind of "domain knowledge" - one may know that, e.g., the ceiling of some value always is present. At least in my case (dealing with normal maps) I always know which keys are present, so a fail-fast get would be nicer.

I propose the following idea: We could just adapt both approaches, as in we have a ceilingChecked(K e) (or a similar name) which throws a NoSuchElementException if there is no ceiling of e and a ceilingOrDefault(K e, K default) which returns default in that case. One could even add a default ceiling(K e) which forwards to ceiling(e, MAX), where MAX is the particular domain's maximal value (which luckily exists for all primitives ...). For floats and doubles using NaN might be an option, too, but that would get awkward I think.

I think it's particularly awkward to throw exceptions for pollFirst() and pollLast(), since those methods exist specifically to avoid the exceptions of first() and last().

Agree - I'm not familiar at all with NavigableSets so I don't have a "feeling" for the use cases. In this case, it might be sensible to just not add a primitive version of this method, since it always will be awkward: Either you know that there is a first element, then you can call the primitive first() or you don't know and calling the pollFirst() simply is the fastest way to get both the information whether the first exists and, if so, which value it is (in a sense, one can think of the returned Long as OptionalLong)

leventov commented 5 years ago

Design and implementation could probably draw a lot from Guava's ImmutableSortedMap (see RegularImmutableSortedSet), which is implemented using a simple sorted array.

vigna commented 5 years ago

But the problem here is mutability. The original post says, indeed, that they are loading the map in order. All strategies I can think for a sorted map based on arrays will have either linear search or linear update time (assuming you won't have space for additional data on top of the array).

leventov commented 5 years ago

I stumbled upon this issue because I have a case where I want to sort a "regular" fastutil's ArrayMap. It would be nice if fastutil's ArrayMaps had methods like sortByKeys() and sortByValues(). Then, there can be an adapter class or method "viewSortedArrayMapAsNavigableMap()" that returns an unmodifiable NavigableMap. If the underlying ArrayMap is updated (e. g. a new entry is added), the view is no longer correct, but that's a responsibility of the user.