Closed TrevorBurnham closed 13 years ago
Very well. Hope this'll bubble to the master someday
Wonder if we could replace dumb stringification with, say, msgpack "images" of values? This should preserve types thus defeating ambiguities, and should work on hash of any values, i suppose.
Notice, that we need only coder part of any advanced stringification technique, which is generally much faster than decoder
This looks like a good opportunity to add an optional iterator function to make _.uniq
work with objects or other cases just like Ruby's array.uniq.
Expanding on TrevorBurnham's implementation, you'd have:
_.uniq = (array, filter) ->
filter ||= (element) ->
element
result = []
hash = {}
for elem in array
filtered = filter(elem)
if filtered of hash
if filtered not in hash[filtered]
result.push elem
hash[filtered].push filtered
else
result.push elem
hash[filtered] = [filtered]
result
arr = [{name:'moe'}, {name:'curly'}, {name:'larry'}, {name:'curly'}]
alert JSON.stringify(_.uniq(arr, (e) -> e.name))
# => [{"name":"moe"},{"name":"curly"},{"name":"larry"}]
I'm afraid that hashing just won't cut it for this -- blindly coercing to a string representation only works for a few datatypes ... and there are many objects with distinct values that have the same representation as a string.
I'd be glad to take a patch for a better uniq, but it has to be a more correct implementation, and come with tests.
Hmm, you got me to thinking about how this implementation can be improved, given the limitations of JavaScript. Other languages have features—Java's hashCode
, Ruby's hash
—that allow objects to be uniquely identified (to a high degree of probability) regardless of their contents. JavaScript doesn't; it just has toString
. So while the implementation I originally proposed would work fine for a situation like
_.uniq (i for i in [1..Math.pow(10, 6)])
, it wouldn't be able to handle a case like
_.uniq ({} for i in [1..Math.pow(10, 6)])
which is actually a fairly realistic scenario—you could want to verify the uniqueness of a large number of objects that lack distinct toString
representations.
So what's the solution? Tag objects that you've seen already. For primitives, use a hash. (You could either have a separate hash for each primitive type, or use a single string-indexed hash where each entry is an array; either approach is practical, since there are only 1-2 primitives that map to any given string, e.g. 1
and '1'
or true
and 'true'
.)
Here's the code:
_.uniq = (array) ->
result = []
hash = {}
# tag all objects in the array with a key that none of them already have
uniqKey = Math.random()
for elem in array
if elem instanceof Object and uniqKey of elem # very unlikely!
return _.uniq array
for elem in array
if elem instanceof Object
# tag each object as we reach it
continue if uniqKey of elem
elem[uniqKey] = 1
result.push elem
else
# store primitives in hash
elemStr = '' + elem
if elemStr of hash
continue if elem in hash[elemStr]
hash[elemStr].push elem
result.push elem
else
hash[elemStr] = [elem]
result.push elem
# restore the objects to their original states
for elem in array
delete elem[uniqKey] if elem instanceof Object
return result
The implementation is slightly more complicated than the existing implementation or the one I previously proposed. However, I believe that this is the only possible approach that's practical for every possible array with length n >= 10^6
.
It passes all of the existing tests for _.uniq
, as well as each of these (which should be added to the suite):
list = [0, '0']
equals JSON.stringify(_.uniq(list)), '[0,"0"]', 'can distinguish numbers from strings'
list = ['1', {toString: -> '1'}, {toString: -> '1'}, '1']
equals JSON.stringify(_.uniq(list)), '["1",{},{}]', 'can distinguish identical objects'
You could add _.uniq (i for i in [1..Math.pow(10, 5)])
and _.uniq ({} for i in [1..Math.pow(10, 5)])
to the test suite as well, using a timer to ensure that each runs in less than 200 ms; the current _.uniq
implementation takes several minutes.
Instead of changing _.uniq(), why not add a new utility similar to it, but that works for most objects?
The performance of
_.uniq
is unacceptable for largen
. For instance,will run for hours and hours without stopping. It needn't be so; as dvv suggested at #153, a hash can be used instead. The tradeoff is increased memory usage, but in practice, this appears to be outweighed by the performance gains.
Here's an implementation in CoffeeScript:
The hash is simply indexed on the stringified version of elements; each value of the hash is an array for disambiguation (e.g.
_.uniq ['1', 1]
is still a no-op). This implementation of_.uniq
takes care of the above example in about a second rather than hours/days.Note that the
sorted
flag is no longer necessary (though I suppose you could use it to reduce memory usage for large, pre-sorted arrays).