nim-lang / RFCs

A repository for your Nim proposals.
136 stars 23 forks source link

MultiTable: thin wrapper around Table with sane API for duplicate keys; deprecate Table.add #200

Closed timotheecour closed 3 years ago

timotheecour commented 4 years ago

Table currently allows duplicate keys, which have issues, some fixable, others harder to fix without affecting code that doesn't care about duplicate keys (the common case is to not have duplicate keys and that should be as fast as possible); eg:

proposal

  import tables

  type TableDup[A,B] =object
    timpl: Table[(A,int), B]

  template `[]=`[A,B](t: TableDup[A,B], a: A, b: B) =
    t.timpl[(a,-1)]=b

  template `[]`[A,B](t: TableDup[A,B], a: A): B =
    t.timpl[(a,-1)]

  template len[A,B](t: TableDup[A,B]): int = t.timpl.len

  template delIndex[A,B](t: var TableDup[A,B], a: A, index: int) =
    t.timpl.del((a, index))

  template add[A,B](t: var TableDup[A,B], a: A, b: B): int =
    var a2 = (a, 0)
    while true:
      if a2 in t.timpl:
        a2[1].inc
      else:
        t.timpl[a2] = b
        break
    a2[1]

  iterator pairs[A,B](t: TableDup[A,B]): tuple[key: A, val: B] =
    for k,v in t.timpl:
      yield (k[0], v)

  template dollarImpl(): untyped {.dirty.} = # FACTOR: same as in tableimpl
    if t.len == 0:
      result = "{:}"
    else:
      result = "{"
      for key, val in pairs(t):
        if result.len > 1: result.add(", ")
        result.addQuoted(key)
        result.add(": ")
        result.addQuoted(val)
      result.add("}")

  proc `$`*[A, B](t: TableDup[A, B]): string =
    ## The ``$`` operator for hash tables. Used internally when calling `echo`
    ## on a table.
    dollarImpl()

  proc main()=
    var t: TableDup[float, string]
    # echo t
    t[1.0] = "baz"
    doAssert t[1.0] == "baz"
    doAssert t.add(2.0, "goo0") == 0
    doAssert t.add(2.0, "goo1") == 1
    doAssert t.add(2.0, "goo2") == 2
    t.delIndex(2.0, 1)
    doAssert $t == """{2.0: "goo0", 1.0: "baz", 2.0: "goo2"}"""

main()

Note

it's a thin wrapper, not a re-implementation, so all the complex logic is reused instead of re-implemented; I know we have too many Table types but the chief concern I have is that these are not mere wrappers around a few basic table types but full blown implementations (with some amount of code reuse). Wheres this is just a wrapper.

alternatives considered

IMO this is worse because you pay a price for each key, not just duplicates (seq indirection); and there are other problems as well

  type TableDup2[A,B] =object
    timpl: Table[A, seq[B]]

refinements

I didn't want to obfuscate the main idea but it's possible to extend to OrderedTable or other types eg:

var t1: TableDupGeneric[OrderedTable[float, string]] var t2: TableDupGeneric[Table[float, string]] var t3: TableDupGeneric[TableRef[float, string]]


* or via a flag
(I haven't tested either)
```nim
type TableDup3[A, B, ordered: static bool] =object
  when ordered:
    timpl: OrderedTable[A,B]
  else:
    timpl: Table[A,B]

links

D20200304T004500

metagn commented 4 years ago

MultiTable might be a better name since that's what I think Apache Commons Collections calls it.

timotheecour commented 4 years ago

MultiTable might be a better name

sure.

c-blake commented 4 years ago

Performance is linear in the number of duplicates per operation. Regular access is constant-time, not linear. "Quadratic" is typically a bit abused here, assuming that there is an outer loop, but it could be only exactly one key has any (or many) duplicates. No need to perpetuate the abuse.

STL uses "multiset" for this concept (which I believe has broader mathematical community acceptance). So, "MultiTable" makes sense as a type name, but I don't think we need this.

While I agree it's unusual, I do not have a problem with both sets and tables being their "multi" versions. Instances' history just determines things. Users just have to know not to abuse them with many duplicates. If they don't abuse, there is no problem. You can't really get around people needing to know how to use some things. Hashes are never guaranteed O(1), anyway. Things being "per instance" rather than in the type system definitely helps keep the number of table types down, as you note. There's nothing forcing people to put dups in, most new people don't know they can, and those that do know they can are probably aware of the performance gotcha. (This is actually one of the few ways to abuse B-trees as well, incidentally.) Anyway, I think the "very flexible" philosophy of Nim fits with less restrained data structures.

I think the divergence between the sets & tables APIs in terms of duplicates is unnecessarily limited, though. Adding dup ability to sets would be another way to go there. That's what I'm doing in my own library. I just have an add proc for the set types. The warning I use for overly deep probing also just says "weak hash/too many dups".

We don't really have a warning convention for the run-time library that I can tell. This cannot be the only example in the stdlib where it is easy to detect misuse/a probable but not necessarily fatal/exception oriented run-time issue. I think the better RFC here would be what this convention should be and to add warnings when abuse happens.

timotheecour commented 4 years ago

on warnings

on keeping allowing duplicate keys in Tables

as I mentioned in top post, supporting duplicate keys in Table results in a half-assed design that impacts performance both for users that don't care about duplicate keys, as well as for users who care about duplicate keys. There is no solution I'm aware of that can satisfy both in the same design; if you disagree, file an RFC or show some POC code. For example, you can't get/add/modify a specific duplicate. With MultiTable as introduced in this RFC, all those problems go away.

wrapper is not re-implementation

My main problem with nim's Table/HashSet/Intests/etc apis is not the number of unique table types, but the fact that these are more or less full re-implementations (with some partial code reuse). MultiTable as proposed here is just a thin wrapper, and in fact has a Table as a field. So all the complex aspects (optimizing performance for collisions, handling tombstones etc) is already taken care of. IMO other table types should be thin wrappers, eg HashSet is a prime candidate; and I believe performance would be at least equal to current HashSet performance; there are other cases.

The only justification for re-implementation is a benchmark showing performance of a wrapper is not as good as a re-implementation in at least some meaningful cases; I haven't seen such benchmark.

c-blake commented 4 years ago

I don't really disagree about the warning points (at a high level, some examples I would challenge). The fact remains there will be times when unanticipated inputs create a weird situation that no one may ever know has pushed things into a "bad regime" without a warning to the end user who I realize may be different than the code author. Some code author could define a constant hash(x) = 123 (via more complex code that just happens to reduce to that) nullifying anything on the same planet as https://github.com/nim-lang/Nim/pull/13440 helping. Or, as I mentioned load up a B-tree with a thousand duplicate keys or something. Warnings about non-catastrophic "misuse" just seem like a question that will come up again and again (and were there already an answer the response to this question might vary).

While I do see the current API as of limited usefulness (allItems, mostly), I really do not see how it needs to impact performance for people who never put duplicate keys in their tables. (Maybe with your particular way to connect the probe sequence to all hash code bits, but certainly not in general, and I'm not even sure for your way.)

I agree the code factoring is a bit messy. For what it's worth, in my "sets are primary, tables in terms of sets" approach there is a similar "down projection" control for keys which might be usable in this way (necessary to implement tables in terms of sets which may constitute another argument in that space of concerns vs. tables with void values).

Once you have a multi-component key like your (A, int), though, one tends to want a pair of "parallel APIs" for full (A, int) and partial (just A) matching (getIndex, etc.). If there are (or might be) many dups, I think it's probably better to take the indirection hit and use seq values.

c-blake commented 4 years ago

I tried to make some of my ideas (not entirely about this dups issue, but not entirely not about it, either) much more concrete over at https://github.com/c-blake/adix. { I had actually written most of that code a couple weeks before I made my comments here.}

Araq commented 3 years ago

We did deprecate Tables.add, you can use a seq value if you need this feature or use one of the external packages that provide this feature. Closing.