jashkenas / underscore

JavaScript's utility _ belt
https://underscorejs.org
MIT License
27.3k stars 5.53k forks source link

Enabled faster algorithm for _.uniq with non-function iteratee #2930

Closed afanasy closed 3 years ago

afanasy commented 3 years ago

The faster algorithm on sorted array will not work for _.uniq with an iteratee, if the iteratee is not a one-to-one function. Because of that it was disabled for all iteratees, even for those producing one-to-one functions, such as String and Array iteratees (handled by property() function). This PR introduces the code that checks the iteratee type and enables faster algorithm if possible.

coveralls commented 3 years ago

Coverage Status

Coverage increased (+0.005%) to 95.223% when pulling 5e5e9b0f8baea87d3b66383430937f224a8400be on afanasy:master into 3d2f17a62b2987c2b7f93880edb0d4e48ce61f10 on jashkenas:master.

afanasy commented 3 years ago

@jgonggrijp Thank you for the detailed and prompt response. I agree that

var faster = isSorted && !isFunction(iteratee);

looks much better. However, I'm failing to understand the main point here - why is it a breaking change and how the _.property iteratees are able to produce not "one-to-one" values, unsuitable for the fast algorithm? Could you provide an example? And just as a side note, introducing Set will "dramatically reduce the performance difference between the fast and the slow algorithms" only for smaller arrays. For larger arrays the fast algorithm O(n) will still be much faster than the slow one O(n^2).

jgonggrijp commented 3 years ago

Thank you for sharing your doubts.

(...) I'm failing to understand the main point here - why is it a breaking change and how the _.property iteratees are able to produce not "one-to-one" values, unsuitable for the fast algorithm? Could you provide an example?

I think this part of the unittests illustrates the one-to-one issue quite well:

https://github.com/jashkenas/underscore/blob/3d2f17a62b2987c2b7f93880edb0d4e48ce61f10/test/arrays.js#L169-L171

The discrepancy arises because the criterion the array was originally sorted by (magnitude) is not the same criterion as the array is being uniqued by (magnitude of the square). You can get into a similar situation with _.property iteratees as follows. Let's say the array was sorted by the rank property and is now being uniqued by the flavor property:

var theArray = [{rank: 1, flavor: 'vanilla'}, {rank: 2, flavor: 'strawberry'}, {rank: 3, flavor: 'vanilla'}];
var result = _.uniq(theArray, true, 'flavor');

My "acceptable change" number 3 above rests on the semantic change of expecting users to only pass isSorted === true when the sort criterion and the unique criterion are the same. Your confusion may stem from the assumption that this is already the case, but it isn't.

And just as a side note, introducing Set will "dramatically reduce the performance difference between the fast and the slow algorithms" only for smaller arrays. For larger arrays the fast algorithm O(n) will still be much faster than the slow one O(n^2).

The standard dictates that Set must provide a faster-than-linear lookup mechanism. Generally, that's either O(k) with small k for hash tables or O(log n) for binary search trees. The slow algorithm then becomes O(nk) or O(n log n) overall, which is not quite as fast as the fast algorithm yet, but way faster than the current slow algorithm, especially for large arrays.

afanasy commented 3 years ago

@jgonggrijp This all makes sense, yes, I assumed users know what they are doing, when passing the isSorted === true, but that may not always be the case. And you are right about Set, it should provide the boost from using internal uniqueness check. Thank you for taking the time to clarify my misunderstanding.