RFC: Procs for searching in sets and tables by "compatible" key

zah commented 6 years ago

Consider the existence of types like the following:

type ComplexObject = ref object
  id: UniqueId
  # many other fields that are expensive to setup

Instances of such types are often added to sets (or they can be used as keys in tables). To achieve this, you just need to implement hashing and equality comparison in the following way:

proc hash*(o: ComplexObject): int = hash(o.id)
proc `==`(lhs, rhs: ComplexObject): bool = lhs.id == rhs.id

But this leaves one problem. When you search the set or the table for an existing key, you must allocate and construct a full instance of ComplexObject which may be expensive or non-trivial. It would be nice if one is able to search the set directly for an object with a specific id without allocating any memory.

A possible solution:

1. Add a new mixed-in proc called keyOf that would be implemented as identity by default, but would allow the user to provide the following override:

template keyOf*(o: ComplexObject): UniqueId = o.id

The implementation in tables.nim and sets.nim will then use this call prior to hashing and checking the equality of the elements in the hash table.

2. Change the signature of procs such as get, containts, etc, to the following:

proc contains[T](s: HashSet[T], key: T or type(keyOf(default(T)))): bool =

.. or just provide alternative names for the lookups using the alternative key type.

dom96 commented 6 years ago

Is it not possible for you to instead use the UniqueId directly as the key for a Table, or to store it in a set?

zah commented 6 years ago

In my particular case, the UniqueId value is rather large and I don't want to have multiple copies of it.

lightness1024 commented 6 years ago

I think C++ solved this by using perfect forwarding. Which allows duck typing to operate at the level of the minimum necessary set of procedures. (hash and ==) if you have ==(alternative, key) for example, perfect forwarding is enough. So that's: proc contains[T, U](s: HashSet[T], key: U): bool

zah commented 6 years ago

I have more supporting cases for this proposal:

In our codebase, it's quite common to create a table using a seq or a string as key and then to create large number of intermediate procs that work with openarrays until they get to the point of querying the table.

At the moment, you have to pay the high price of converting the openarray back to a sequence just to query the table, but with the suggested functionality here, it would be possible to say that the keyOf for sequences and strings is defined as the openarray obtained from them.

andreaferretti commented 6 years ago

Seems a rather contrived use case

zah commented 6 years ago

We are heavy users of openarray and for good reasons. Keys are sometimes embedded in network packets (which we can slice), in serialized on-disk formats (again, supporting slicing) and sometimes by just treating the bytes of a cryptographic hash result as a key. The example is not that contrived at all.

The "perfect forwarding" alternative suggested above could also work, but it has some downsides, such as the fact that you have to define operators such as ComplexObject == UniqueId, which may not be always appropriate.

Araq commented 6 years ago

Please explain why you can't use Table[UniqueId, X] instead.

zah commented 6 years ago

I've already explained that.

My original problem was that UniqueId was a rather large value and I didn't want to have multiple copies of it laying around in memory (one stored inside the table key slot and one stored inside the object).

But even if I accept the copies, the second problem I brought today is still there. openarrays and seq and string keys are not compatible and you must allocate a copy of the key just so you can make a query.

My proposal solves both problems in a simple way.

Araq commented 6 years ago

No you didn't explain it well enough. If you have a mapping from UniqueId to ComplexObject there is no need to also have UniqueId part of ComplexObject and that's why this problem comes up rarely. And just fyi, I actually like your proposal quite a bit, but it needs to be justified well.

zah commented 6 years ago

The pattern of storing an object identifier as a field inside a larger object is very frequent. In many problem domains, entities carrying IDs are passed around to functions that need to access the ID (in order to sent it over the network, or store it in a file format, etc). There are separate less frequently used lookup tables that are used to locate objects.

Think about it, isn't the compiler itself full of such entities? We are just lucky to use cheap numeric IDs that reduce the cost of the problem. In my particular original case, the size of the ID was exceeding the size of the additional payload and there was a very large number of entities turning this into a real practical problem.

Araq commented 6 years ago

Alright, assuming a keyOf template, why can't keyOf be used to eliminate these:


proc hash*(o: ComplexObject): int = hash(o.id)
proc `==`(lhs, rhs: ComplexObject): bool = lhs.id == rhs.id

After all, these are directly derived from keyOf.

zah commented 6 years ago

Well, that's exactly the point. I've mentioned the definitions above only in the description of the current situation and its problems. keyOf will eliminate them.

Araq commented 6 years ago

Aha, ok, please make it happen.

zah commented 6 years ago

OK, How should we mark accepted RFCs btw? Can we have a tag for them like ready-for-implementation?

I think there are a lot of people in the community who would enjoy making a contribution to Nim, but may be anxious to submit a pull request that might be rejected. And there are quite a lot of RFCs that are good, but appear stalled, so having this tag and using it may drive more contributions.

data-man commented 6 years ago

I propose to add the labels Rejected and Accepted.

dom96 commented 6 years ago

I still don't understand this proposal :/

But in any case, I'll create an RFC: Accepted label. Rejections can be signalled by closing the issue.

data-man commented 6 years ago

@dom96

Rejections can be signalled by closing the issue.

I think RFC: Rejected is useful to filter existing RFCs (will need to review Issues and assign this label).

Araq commented 6 years ago

You can easily filter on closed issues, @data-man

arnetheduck commented 3 years ago

A common scenario this proposal doesn't take into account is that one might want to have different lookup strategies per table instance - in C++ this is dealt with by passing the comparison / hash operators to the table at construction time - this is very common when working with complex data types that might be indexed in multiple ways - in one module I may want to index by guid and in another by something else. This is generally not a decision tied to the type, but rather to a specific table that indexes a specific type.

boost::multi_index expands on this idea further.

nim-lang / RFCs

RFC: Procs for searching in sets and tables by "compatible" key #33

A possible solution: