`==` incorrect for tables with repeated keys

nim-lang / Nim

Nim is a statically typed compiled systems programming language. It combines successful concepts from mature languages like Python, Ada and Modula. Its design focuses on efficiency, expressiveness, and elegance (in that order of priority).

https://nim-lang.org

Other

16.23k stars 1.47k forks source link

`==` incorrect for tables with repeated keys #13499

Open timotheecour opened 4 years ago

timotheecour commented 4 years ago

== incorrect for tables with repeated keys

Example

  import tables
  let t = newTable[int, int]()
  t.add(15, 1)
  t.add(15, 1)
  t.add(16, 3)
  echo t

  let t2 = newTable[int, int]()
  t2.add(15, 1)
  t2.add(16, 3)
  t2.add(16, 3)
  echo t2
  echo t == t2

Current Output

{15: 1, 15: 1, 16: 3}
{15: 1, 16: 3, 16: 3}
true

Expected Output

{15: 1, 15: 1, 16: 3}
{15: 1, 16: 3, 16: 3}
false

Possible Solution

fix this to handle duplicate keys:

template equalsImpl(s, t: typed) =
  if s.counter == t.counter:
    # different insertion orders mean different 'data' seqs, so we have
    # to use the slow route here:
    for key, val in s:
      if not t.hasKey(key): return false
      if t.getOrDefault(key) != val: return false
    return true

Additional Information

probably never worked, eg broken in 1.0.6 and in devel 3dad130034e3212159dda5d61988ca2b18959b24 1.1.1

proposal

Handling correctly duplicated keys may make certain operations slower for the common case (no repeated keys). If fixing this issue makes == slower (ditto with other operations?), maybe we should consider adding a runtime (not compile time, that adds more complexity) flag enableDups = false, and for all operations where duplicate keys could be handled, check this flag and throw if needed for some operations (eg: add). This should not result in slowdown as it's a single bool flag that should be easy for branch predictors

narimiran commented 4 years ago

I'm still of opinion that tables.add is bad, shouldn't have made it to v1.0, and 98% of users don't need it (and if they use it, they probably don't know what it really does - it might cause unexpected behaviour).

So in my book, this should be a low priority, and if fixing this causes any slowdowns - we shouldn't do it.

andreaferretti commented 4 years ago

I realized just now that tables in Nim support duplicate keys. :-/ Now I wonder why in the only use case I can think of (HTTP headers) we take the trouble of defining HttpHeaderValues = distinct seq[string], and then adding a converter HttpHeaderValues -> string. It seems that the stdlib implements a mechanism to support repeated keys (but somehow pretend that keys are single for the most common use case) twice

Araq commented 4 years ago

It's not that bad, the only use case for table.== seems to be inside tests.

c-blake commented 3 years ago

First, I think a better exhibition of this defect is simply:

  import tables   # watch me fail!
  assert {1:2, 1:2, 3:4}.toTable != {1:2, 3:4, 3:4}.toTable

I would also qualify (different from the issue title) that (key,value) pairs need to be repeated, not only keys. I.e., {1:2, 1:0, 3:4}.toTable != {1:2, 3:4, 3:4}.toTable as expected.

Second, another way to go here that does not penalize any other operations is, during/after the equalsImpl membership comparison loop, to build up two CountTables or a local analogue and then compare those, only returning true if the two new CountTables are also the same.

This does A) use a bit more than twice the memory and B) require a hash(V) to be available so that (K,V) can be a key in the CountTables and C) for tables without duplicate key,val pairs makes == -> true slower. This doesn't change the big-O upper bound scaling..but it probably does take 2-3x the work of the current approach that assumes no dup kv-pairs (for true output).

The compiler will inform people when they need to define hash(V) { much as it would tell them they need ==(a,b: V) in the current impl if that were ever necessary }. So, only A) and C) really matter. A) is kinda you get what you asked for. I'm not sure if the limited true-path only C) is what @narimiran was thinking of when he said "any slowdowns".

I would think if C) is acceptable then this counting approach is the easy fix here. Though I doubt many know about dup key (and dup key,val!) powers of Table, it is also probably very rare to compare two tables at all and more rare still to expect them to be equal except as @Araq mentioned in tests. Getting something like this right in tests is good "project optics" for a language advertising "safety". If, OTOH, C) is unacceptable then I think you should close this issue as "won't fix".

For completeness, I should also mention that we could also prefix the final count compare with a commutative hash like sets.hash, but this doesn't address C). It is also be possible to skip the entire fix pathway {eliminating C} without enableDups=false like the proposal if we kept a counter in each Table of how many duplicate key,val pairs there are, incremented on every dup insert/decremented on every dup delete. Then we could elide the CountTable junk if there were no duplicates. The problem here is that, to avoid C), we'd make every edit a bit slower which is probably undesirable, and the flag would probably be preferred.

{ Indeed, if we were willing to make every edit a bit slower, Robin Hood re-org can actually go further and break key ties with <=(a,b: V) and then the slots array is uniquely defined by table membership (talk about "ageless"!), dups and all. Then even memcmp should work (for packed slots anyway) and you aren't going to beat memcmp. That extra per-edit operation work is not always repaid, though. And that's a whole big re-write. }

hashbackup commented 3 years ago

@c-blake your test seems like an unrelated bug: .toTable() always creates tables w/o duplicates. Shown with:

import tables   # watch me fail!
echo {1:2, 1:2, 3:4}.toTable
echo {1:2, 3:4, 3:4}.toTable
assert {1:2, 1:2, 3:4}.toTable != {1:2, 3:4, 3:4}.toTable

ms:nx jim$ ./tabequal
{3: 4, 1: 2}
{3: 4, 1: 2}
/Users/jim/nx/tabequal.nim(4) tabequal
/Users/jim/nim/lib/system/assertions.nim(30) failedAssertImpl
/Users/jim/nim/lib/system/assertions.nim(23) raiseAssert
/Users/jim/nim/lib/system/fatal.nim(49) sysFatal
Error: unhandled exception: /Users/jim/nx/tabequal.nim(4, 8) `{1: 2, 1: 2, 3: 4}.toTable != {1: 2, 3: 4, 3: 4}.toTable`  [AssertionDefect]

.toTable currently does for key, val in items(pairs): result[key] = val but could be changed to for key, val in items(pairs): result.add(key, val)

Making that change, it fails because of the original problem:

ms:nx jim$ ./tabequal
{3: 4, 1: 2, 1: 2}
{3: 4, 3: 4, 1: 2}
/Users/jim/nx/tabequal.nim(4) tabequal
/Users/jim/nim/lib/system/assertions.nim(30) failedAssertImpl
/Users/jim/nim/lib/system/assertions.nim(23) raiseAssert
/Users/jim/nim/lib/system/fatal.nim(49) sysFatal
Error: unhandled exception: /Users/jim/nx/tabequal.nim(4, 8) `{1: 2, 1: 2, 3: 4}.toTable != {1: 2, 3: 4, 3: 4}.toTable`  [AssertionDefect]

Seems like it would be a good idea to have a setting like t.dups = true/false with the default being false as most people (my opinion) expect for hash tables. Or maybe the first time someone calls add, it sets the flag to true, for compatibility.

c-blake commented 3 years ago

@hashbackup - Oops! Good, catch. Thanks! I think if we continue to support dup keys then we should change toTable as you mentioned, but that's a super easy patch.

If we wanted any kind of table constructor flags, then we should also be able to pass them to toTable as additional arguments, e.g. {1:2,1:2,3:4}.toTable(dups=false), but right now the only relevant one is size and that's already covered.

Another inconsistency is that while Table is a multitable, HashSet is not a multiset. Could be solved with just a HashSet.add & OrderdSet.add. Seems relevant to the general discussion.

hashbackup commented 3 years ago

One common use of hash tables / sets is to eliminate duplicates from a list. Making the default no dups, like most other languages / hash tables, makes more sense to me.

hashbackup commented 3 years ago

@c-blake Can this be done with one count table each? It seems like there are two different cases:

1:1 1:1 1:2   1:1 1:2 1:2  -> count values, not keys

1:1 2:1 2:1   1:1 1:1 2:1  -> count keys, not values

It seems like you'd need one count table for keys and another for values, for both tables, so 4.

c-blake commented 3 years ago

I may not have described it clearly but I was imagining two (one for each table being compared) CountTable[tuple[key: K, val: V]] where we defined an appropriate hash(the tuple) in terms of the already necessary hash(K) and the newly necessary hash(V).

hashbackup commented 3 years ago

Ah, ok. My way is dumb and won't work. :-)

For regular Tables (not OrderedTables) the hash tables are already a seq[tuple[key:K, val: V]]. We could make a copy of each seq, sort it, compare it. OrderedTables could be copied, ot.sort() both, compare the two seqs. Or, build two new seqs that don't have the hcode/linked list junk, sort and compare them. Not sure this would be better than your way, but it's another option.

Edit: they'd have to be rebuilt to remove the hash code (and links).

If you know there are duplicates, the initial membership test can be skipped with either method (I think).

c-blake commented 3 years ago

Sorted seqs are another option. My approach uses less memory (e.g. avg num dups = 3, uses 3x less memory) and less "operation" time at scale (linear), but the stdlib merge sort might be sufficiently more cache friendly to use less "wall" time in spite of being O(n*lg n). So, as usual, which is better/faster "just depends". ;-)

As mentioned, the initial membership test makes == potentially much faster (distance to first mismatch, as fast as the non-dups case) for a false result via early exit. I thought keeping it "as fast as it ever was except for true results" might appease @narimiran - like opinions since it is (probably?) rare to expect true results except in tests which need not be optimized. That said, if true is expected slowness is already expected (like comparing long strings byte-by-byte). Then again, making the slow worst case 2-3x worse for unique key tables could bother some.

A whole other possibility with zero impact on current code and also no need of an instance-level flag is to provide proc equalsDups[K,V](s, t: Table[K,V]; expect: bool). Then we just document it (&add) to say "client code calling add with maybe dup pairs also comparing tables should define ==[K,V](s,t: Table[K,V]) = equalsDups(s,t,expect=true) to both recover correctness and specify output expectation". If the client code cannot statically know if it expects true or false then it is just out of luck optimization-wise..(Often the case!)

We could also have toTable take an allowDups=false so it'd dedup by default but also have an easy way to activate add mode. Of course, code wanting add mode could also define its own toTable -- as well as its own equalsDups or == -- no private field access is needed. This last point potentially reduces this whole issue a "documentation bug for advanced users".

hashbackup commented 3 years ago

My approach uses less memory (e.g. avg num dups = 3, uses 3x less memory) the initial membership test makes == potentially much faster (distance to first mismatch, as fast as the non-dups case) for a false result via early exit.

Good points

c-blake commented 3 years ago

The better algorithm here, which I should have thought of immediately, would be to work like the allValues iterator. What I am thinking of is conceptually like:

for key, val in s:
  for (sv, tv, diffLens) in zip(s.allValues(key), t.allValues(key)):
    if diffLens or sv != tv: return false

but with a zip that takes two iterators instead of the openArrays of sequtils.zip and can indicate when one such iterator is exhausted via the diffLens: bool tuple member -- just so you get my notation. The "real" not conceptual code might A) inline the allValues work for both tables, B) merge the s "lookup"/"iterator start" with the parent loop and also C) use some local mechanism to avoid comparing all the values more than the first time (to avoid quadratic in number of dups costs - worst case a small lg(t.len)-scale HashSet[int] of already cmped slots clearable when an empty slot is found, but there may be something simpler.).

When no duplicate pairs are involved this approach would require no real memory (an empty HashSet[int]) and should be actually a bit faster than the current one since it will be just 1 hash lookup in t not 2 like the current one-for-key and one-for-val impl (yes, the 2nd one is a hot cache execution, but still has to compute hash functions/chase collision clusters). It will also matter more that this approach be in-module since allValues work mucks with .data[].

Now, CountTable has no allValues because it has unique keys by construction. So, we might give it the current equalsImpl and do a new equalsImplDups as part of this fix.

It's a bit of loopy logic to work out the details, but I think a fix along these lines is the way to resolve this issue.

Araq commented 3 years ago

Another inconsistency is that while Table is a multitable, HashSet is not a multiset. Could be solved with just a HashSet.add & OrderdSet.add. Seems relevant to the general discussion.

There is consensus on phasing out the "multi" aspect of Tables aka deprecating table.add. However, you can always simply use []= instead of add and iirc Python does not support add on dicts anyway so I dunno why newcomers from Python stumble upon it.

hashbackup commented 3 years ago

When I was first looking for the equivalent of a Python dict, Nim's syntax was different enough that I had to read the Tables docs a couple of times to understand how they worked vs Python's dict. For example map = {} doesn't work, because it needs a type for keys and values. I expected that, but it was unexpected to import a module to get dictionaries. And "tables" is a new term for them I hadn't heard except in COBOL, where arrays are called tables.

First thing that threw me off was that the table page starts with a lot of examples how to create populated tables, but no example of how to create an empty table.

Then there was some confusion about whether I had to call initTable. I wanted to set the initial table size (since I read about it), and var map: Table[int, int](100) didn't work. The doc said it is not necessary to call this function explicitly, but actually it is if you want to specify the size.

So I stumbled upon add from reading the tables doc page a few times. If add wasn't there, I would have never known or expected that duplicates could be added to a table.

It's hard to realize how strange new things can seem to newcomers when they are like 2nd nature to others who have been using Nim for years. It only takes a few days for them not to be strange of course, but the initial encounter is usually a little jarring because:

newcomers don't understand the type system, so the proc declarations don't always make sense
the docs are auto-generated and the flow is not always the way a human editor would do it. For example, humans start with higher-level concepts first, whereas Nim auto docs start with the lower-level concepts first because that's the order they're declared in a Nim program.

c-blake commented 3 years ago

There is consensus on phasing out the "multi" aspect of Tables

There is? Last I saw there was a TC RFC that literally no one even engaged with (no emojis even!) besides me and @hlaaftana's name suggestion where TC was at least partly misarguing performance implications.

Looking into other channels, as recently as 2019/02/10:12:55:09 @narimiran on IRC said on the topic: "it doesn't matter what somebody thinks, jesus christ. it is part of nim v1.0." Then he reversed himself a year later on 2020/03/09 in IRC. I guess @Araq made this pronouncement on IRC 2020-03-09-"14:47:56 I consider this solved, we will deprecate the 'add' proc" you got immediate push back from @disruptek that seemed to simply fade. This consensus was news to me. Decision/discussion should really be reflected on TC's RFC.

I get that add surprises people. I even said 5.5 years ago that this was an unusual feature and even mentioned 1.0/compatibility concerns. It seemed important to you as a "teachable moment" for wayward developers, but you did raise deprecation as a possibility. Boy, was 1.0 further in the future than we thought, eh? ;-)

In any event, if the plan is now to instead introduce a MultiTable and MultiOrderedTable then my above comments here apply to having a MultiSet and MultiOrderedSet. Therein lies maybe the strongest counterpoint to Multi*..Complex class/type diagrams of various sets and tables and what they are all capable of instead of half as many "superset" types that can do both. If we add Compact variants than we get four more (Multi|Not)Compact(Set|Table)`.

This doubling of all related types has not seemed to arise much in prior discussions. That also represents cognitive load on both maintainers and users -- not dissimilar to knowing WTF add does. An obscure/ignored type is pretty similar to an obscure proc add/iterator allValues. If people are wanting a specific ordering among values, then there are also "queuing" discipline options, like delFirst or delLast, add_front|prepend and so on about where to put the values as validly raised by TC's RFC (or they can just use Table[K, seq[V]]). I think those ops could also be implemented "inline" in the Table, though, without performance implications for those not using them (but only in-module unless we export .data[]).

A (viable?) alternative to "deprecation & type duplication" might be keeping the superset-capability types with an allowDups: bool flag on instances about whether to have add just raise an exception. If it defaults to true it is backward compat, but it could probably also default to false with release notes. Either way, it both "advertises & constrains" the capability better. I think such advertisement has been the main "problem". The words multimap/multiset could also appear in the docs since this is what C++ STL/math calls them. If we add queuing order ops we could even add other flags to specify "add defaults to push_front", "del defaults to pop_back", etc. Just an idea to maybe keep type proliferation down/retain backward compat.

narimiran commented 3 years ago

Regarding the confusing tables.add, here's a new PR: https://github.com/nim-lang/Nim/pull/15047 :)

c-blake commented 3 years ago

Anyway, I'm happy to do some PR to fix == (as described in my final best idea), probably next week. Unless, that is, we decide to drop the functionality completely and tell @disruptek and those like him to instead just use the adix tables (or whatever). If we're dropping it then it doesn't make sense to fix ==.