MultiTable: thin wrapper around Table with sane API for duplicate keys; deprecate Table.add

timotheecour commented 4 years ago

Table currently allows duplicate keys, which have issues, some fixable, others harder to fix without affecting code that doesn't care about duplicate keys (the common case is to not have duplicate keys and that should be as fast as possible); eg:

https://github.com/nim-lang/Nim/issues/13500
https://github.com/nim-lang/Nim/issues/13499
we can't delete a specific duplicate key, del just deletes "one of the duplicates"
we can't re-assign the value for a specific duplicate
we can't get the value for a specific duplicate
performance is bad for duplicate keys because of hash collisions; IIRC it's quadratic in card(duplicates of a key)

proposal

deprecate Table.add: would still work (for now at least) but would show a deprecation msg showing best course of action (including what this RFC introduces)
add a thin wrapper TableDup around Table to offer sane API to handle duplicate keys, including ability to access / re-assign / delete a specific duplicate
performance will be as fast as for regular tables witout duplicate keys for some operations involving duplicate keys, eg: access/re-assign/delete a specific duplicate
performance for values and add: I'm not sure but shouldn't be worse

  import tables

  type TableDup[A,B] =object
    timpl: Table[(A,int), B]

  template `[]=`[A,B](t: TableDup[A,B], a: A, b: B) =
    t.timpl[(a,-1)]=b

  template `[]`[A,B](t: TableDup[A,B], a: A): B =
    t.timpl[(a,-1)]

  template len[A,B](t: TableDup[A,B]): int = t.timpl.len

  template delIndex[A,B](t: var TableDup[A,B], a: A, index: int) =
    t.timpl.del((a, index))

  template add[A,B](t: var TableDup[A,B], a: A, b: B): int =
    var a2 = (a, 0)
    while true:
      if a2 in t.timpl:
        a2[1].inc
      else:
        t.timpl[a2] = b
        break
    a2[1]

  iterator pairs[A,B](t: TableDup[A,B]): tuple[key: A, val: B] =
    for k,v in t.timpl:
      yield (k[0], v)

  template dollarImpl(): untyped {.dirty.} = # FACTOR: same as in tableimpl
    if t.len == 0:
      result = "{:}"
    else:
      result = "{"
      for key, val in pairs(t):
        if result.len > 1: result.add(", ")
        result.addQuoted(key)
        result.add(": ")
        result.addQuoted(val)
      result.add("}")

  proc `$`*[A, B](t: TableDup[A, B]): string =
    ## The ``$`` operator for hash tables. Used internally when calling `echo`
    ## on a table.
    dollarImpl()

  proc main()=
    var t: TableDup[float, string]
    # echo t
    t[1.0] = "baz"
    doAssert t[1.0] == "baz"
    doAssert t.add(2.0, "goo0") == 0
    doAssert t.add(2.0, "goo1") == 1
    doAssert t.add(2.0, "goo2") == 2
    t.delIndex(2.0, 1)
    doAssert $t == """{2.0: "goo0", 1.0: "baz", 2.0: "goo2"}"""

main()

Note

it's a thin wrapper, not a re-implementation, so all the complex logic is reused instead of re-implemented; I know we have too many Table types but the chief concern I have is that these are not mere wrappers around a few basic table types but full blown implementations (with some amount of code reuse). Wheres this is just a wrapper.

alternatives considered

IMO this is worse because you pay a price for each key, not just duplicates (seq indirection); and there are other problems as well

  type TableDup2[A,B] =object
    timpl: Table[A, seq[B]]

refinements

I didn't want to obfuscate the main idea but it's possible to extend to OrderedTable or other types eg:

completely generic: (I haven't tested this)


type TableDupGeneric[T] =object
timpl: T[(T.A, int), T.B]

var t1: TableDupGeneric[OrderedTable[float, string]] var t2: TableDupGeneric[Table[float, string]] var t3: TableDupGeneric[TableRef[float, string]]


* or via a flag
(I haven't tested either)
```nim
type TableDup3[A, B, ordered: static bool] =object
  when ordered:
    timpl: OrderedTable[A,B]
  else:
    timpl: Table[A,B]

links

D20200304T004500

metagn commented 4 years ago

MultiTable might be a better name since that's what I think Apache Commons Collections calls it.

timotheecour commented 4 years ago

MultiTable might be a better name

sure.

c-blake commented 4 years ago

Performance is linear in the number of duplicates per operation. Regular access is constant-time, not linear. "Quadratic" is typically a bit abused here, assuming that there is an outer loop, but it could be only exactly one key has any (or many) duplicates. No need to perpetuate the abuse.

STL uses "multiset" for this concept (which I believe has broader mathematical community acceptance). So, "MultiTable" makes sense as a type name, but I don't think we need this.

While I agree it's unusual, I do not have a problem with both sets and tables being their "multi" versions. Instances' history just determines things. Users just have to know not to abuse them with many duplicates. If they don't abuse, there is no problem. You can't really get around people needing to know how to use some things. Hashes are never guaranteed O(1), anyway. Things being "per instance" rather than in the type system definitely helps keep the number of table types down, as you note. There's nothing forcing people to put dups in, most new people don't know they can, and those that do know they can are probably aware of the performance gotcha. (This is actually one of the few ways to abuse B-trees as well, incidentally.) Anyway, I think the "very flexible" philosophy of Nim fits with less restrained data structures.

I think the divergence between the sets & tables APIs in terms of duplicates is unnecessarily limited, though. Adding dup ability to sets would be another way to go there. That's what I'm doing in my own library. I just have an add proc for the set types. The warning I use for overly deep probing also just says "weak hash/too many dups".

We don't really have a warning convention for the run-time library that I can tell. This cannot be the only example in the stdlib where it is easy to detect misuse/a probable but not necessarily fatal/exception oriented run-time issue. I think the better RFC here would be what this convention should be and to add warnings when abuse happens.

timotheecour commented 4 years ago

on warnings

warnings are ~~useless~~ not very helpful if you don't also offer the user a trivial way to address them, ideally via setting an input parameter differently.
returning a relevant statistics (eg collision search depth of last query) is more useful / actionable than a warning; a simple way would be simply via an added API getProbeDepth(t: Table): int (perhaps others too, eg getFillRatio(t: Table): float)
And even then, there are often better ways than bailing out with a warning, eg see https://github.com/nim-lang/Nim/pull/13440 which dynamically switches between linear and pseudorandom on a per-query basis. Likewise with introsort algorithm (https://en.wikipedia.org/wiki/Introsort) that also dynamically switches depending on internal performance statistics. These dynamic switching often have more context to do the right thing for a given query than a user with limited knowledge, but yes, there are cases where an explicit parameter tuning by user are desirable.

on keeping allowing duplicate keys in Tables

as I mentioned in top post, supporting duplicate keys in Table results in a half-assed design that impacts performance both for users that don't care about duplicate keys, as well as for users who care about duplicate keys. There is no solution I'm aware of that can satisfy both in the same design; if you disagree, file an RFC or show some POC code. For example, you can't get/add/modify a specific duplicate. With MultiTable as introduced in this RFC, all those problems go away.

wrapper is not re-implementation

My main problem with nim's Table/HashSet/Intests/etc apis is not the number of unique table types, but the fact that these are more or less full re-implementations (with some partial code reuse). MultiTable as proposed here is just a thin wrapper, and in fact has a Table as a field. So all the complex aspects (optimizing performance for collisions, handling tombstones etc) is already taken care of. IMO other table types should be thin wrappers, eg HashSet is a prime candidate; and I believe performance would be at least equal to current HashSet performance; there are other cases.

The only justification for re-implementation is a benchmark showing performance of a wrapper is not as good as a re-implementation in at least some meaningful cases; I haven't seen such benchmark.

c-blake commented 4 years ago

I don't really disagree about the warning points (at a high level, some examples I would challenge). The fact remains there will be times when unanticipated inputs create a weird situation that no one may ever know has pushed things into a "bad regime" without a warning to the end user who I realize may be different than the code author. Some code author could define a constant hash(x) = 123 (via more complex code that just happens to reduce to that) nullifying anything on the same planet as https://github.com/nim-lang/Nim/pull/13440 helping. Or, as I mentioned load up a B-tree with a thousand duplicate keys or something. Warnings about non-catastrophic "misuse" just seem like a question that will come up again and again (and were there already an answer the response to this question might vary).

While I do see the current API as of limited usefulness (allItems, mostly), I really do not see how it needs to impact performance for people who never put duplicate keys in their tables. (Maybe with your particular way to connect the probe sequence to all hash code bits, but certainly not in general, and I'm not even sure for your way.)

I agree the code factoring is a bit messy. For what it's worth, in my "sets are primary, tables in terms of sets" approach there is a similar "down projection" control for keys which might be usable in this way (necessary to implement tables in terms of sets which may constitute another argument in that space of concerns vs. tables with void values).

Once you have a multi-component key like your (A, int), though, one tends to want a pair of "parallel APIs" for full (A, int) and partial (just A) matching (getIndex, etc.). If there are (or might be) many dups, I think it's probably better to take the indirection hit and use seq values.

c-blake commented 4 years ago

I tried to make some of my ideas (not entirely about this dups issue, but not entirely not about it, either) much more concrete over at https://github.com/c-blake/adix. { I had actually written most of that code a couple weeks before I made my comments here.}

Araq commented 3 years ago

We did deprecate Tables.add, you can use a seq value if you need this feature or use one of the external packages that provide this feature. Closing.

nim-lang / RFCs