jashkenas / underscore

JavaScript's utility _ belt
https://underscorejs.org
MIT License
27.33k stars 5.53k forks source link

Use hash for _.uniq #156

Closed TrevorBurnham closed 13 years ago

TrevorBurnham commented 13 years ago

The performance of _.uniq is unacceptable for large n. For instance,

_ = require 'underscore'
arr = (x % 500000 for x in [1..Math.pow(10, 6)])
console.log arr.length
console.log arr.length is _.uniq(arr).length * 2 # should be true

will run for hours and hours without stopping. It needn't be so; as dvv suggested at #153, a hash can be used instead. The tradeoff is increased memory usage, but in practice, this appears to be outweighed by the performance gains.

Here's an implementation in CoffeeScript:

_.uniq = (array) ->
  result = []
  hash = {}
  for elem in array
    if elem of hash
      if elem not in hash[elem]
        result.push elem
        hash[elem].push elem
    else
      result.push elem
      hash[elem] = [elem]
  result

The hash is simply indexed on the stringified version of elements; each value of the hash is an array for disambiguation (e.g. _.uniq ['1', 1] is still a no-op). This implementation of _.uniq takes care of the above example in about a second rather than hours/days.

Note that the sorted flag is no longer necessary (though I suppose you could use it to reduce memory usage for large, pre-sorted arrays).

dvv commented 13 years ago

Very well. Hope this'll bubble to the master someday

dvv commented 13 years ago

Wonder if we could replace dumb stringification with, say, msgpack "images" of values? This should preserve types thus defeating ambiguities, and should work on hash of any values, i suppose.

Notice, that we need only coder part of any advanced stringification technique, which is generally much faster than decoder

albemuth commented 13 years ago

This looks like a good opportunity to add an optional iterator function to make _.uniq work with objects or other cases just like Ruby's array.uniq. Expanding on TrevorBurnham's implementation, you'd have:

_.uniq = (array, filter) ->
  filter ||= (element) ->
    element 
  result = []
  hash = {}
  for elem in array
    filtered = filter(elem)
    if filtered of hash
      if filtered not in hash[filtered]
        result.push elem
        hash[filtered].push filtered
    else
      result.push elem
      hash[filtered] = [filtered]
  result

arr = [{name:'moe'}, {name:'curly'}, {name:'larry'}, {name:'curly'}]
alert JSON.stringify(_.uniq(arr, (e) -> e.name))
# => [{"name":"moe"},{"name":"curly"},{"name":"larry"}]
jashkenas commented 13 years ago

I'm afraid that hashing just won't cut it for this -- blindly coercing to a string representation only works for a few datatypes ... and there are many objects with distinct values that have the same representation as a string.

I'd be glad to take a patch for a better uniq, but it has to be a more correct implementation, and come with tests.

TrevorBurnham commented 13 years ago

Hmm, you got me to thinking about how this implementation can be improved, given the limitations of JavaScript. Other languages have features—Java's hashCode, Ruby's hash—that allow objects to be uniquely identified (to a high degree of probability) regardless of their contents. JavaScript doesn't; it just has toString. So while the implementation I originally proposed would work fine for a situation like

_.uniq (i for i in [1..Math.pow(10, 6)])

, it wouldn't be able to handle a case like

_.uniq ({} for i in [1..Math.pow(10, 6)])

which is actually a fairly realistic scenario—you could want to verify the uniqueness of a large number of objects that lack distinct toString representations.

So what's the solution? Tag objects that you've seen already. For primitives, use a hash. (You could either have a separate hash for each primitive type, or use a single string-indexed hash where each entry is an array; either approach is practical, since there are only 1-2 primitives that map to any given string, e.g. 1 and '1' or true and 'true'.)

Here's the code:

_.uniq = (array) ->
  result = []
  hash = {}

  # tag all objects in the array with a key that none of them already have
  uniqKey = Math.random()
  for elem in array
    if elem instanceof Object and uniqKey of elem # very unlikely!
      return _.uniq array

  for elem in array
    if elem instanceof Object
      # tag each object as we reach it
      continue if uniqKey of elem
      elem[uniqKey] = 1
      result.push elem
    else
      # store primitives in hash
      elemStr = '' + elem
      if elemStr of hash
        continue if elem in hash[elemStr]
        hash[elemStr].push elem
        result.push elem
      else
        hash[elemStr] = [elem]
        result.push elem

  # restore the objects to their original states
  for elem in array
    delete elem[uniqKey] if elem instanceof Object

  return result

The implementation is slightly more complicated than the existing implementation or the one I previously proposed. However, I believe that this is the only possible approach that's practical for every possible array with length n >= 10^6.

It passes all of the existing tests for _.uniq, as well as each of these (which should be added to the suite):

list = [0, '0']
equals JSON.stringify(_.uniq(list)), '[0,"0"]', 'can distinguish numbers from strings'
list = ['1', {toString: -> '1'}, {toString: -> '1'}, '1']
equals JSON.stringify(_.uniq(list)), '["1",{},{}]', 'can distinguish identical objects'

You could add _.uniq (i for i in [1..Math.pow(10, 5)]) and _.uniq ({} for i in [1..Math.pow(10, 5)]) to the test suite as well, using a timer to ensure that each runs in less than 200 ms; the current _.uniq implementation takes several minutes.

NHQ commented 13 years ago

Instead of changing _.uniq(), why not add a new utility similar to it, but that works for most objects?