pbeshai / tidy

Tidy up your data with JavaScript, inspired by dplyr and the tidyverse
https://pbeshai.github.io/tidy
MIT License
725 stars 21 forks source link

fix: treat sort comparisons that aren't <,>,or == as 0 #23

Closed veltman closed 3 years ago

veltman commented 3 years ago

arrange() uses d3.ascending and d3.descending to apply each comparator in a sequence.

d3.ascending returns NaN in cases like:

d3.ascending(undefined, undefined)
d3.ascending("x", 5)
d3.ascending(NaN, NaN)

This means that in a flow like:

tidy(data, arrange(["nonexistentkey", "population"]));

it will leave the data in its original order instead of sorting by population because any non-zero, including NaN, stops the comparison chain.

It seems like the expected behavior would be to continue past those comparisons and apply whatever's next in the chain.

pbeshai commented 3 years ago

Thanks Noah, this seems reasonable, but seems to have a side-effect of moving nulls to the top (which is perhaps an improvement over the current behavior which seems to be unpredictable).

For example, with your change:

tidy(
  [
    { str: 'foo', value: null },
    { str: 'foo', value: 9 },
    { str: 'foo', value: 1 },
    { str: 'foo', value: null, o: 1 },
    { str: 'bar', value: 3 },
    { str: 'bar', value: null },
    { str: 'bar', value: 7 },
    { str: 'bar', value: 2 },
  ],
  arrange(['str', desc('doesntexist'), asc('value')]),
)

We get

    ┌─────────┬───────┬───────┬───┐
    │ (index) │  str  │ value │ o │
    ├─────────┼───────┼───────┼───┤
    │    0    │ 'bar' │ null  │   │
    │    1    │ 'bar' │   2   │   │
    │    2    │ 'bar' │   3   │   │
    │    3    │ 'bar' │   7   │   │
    │    4    │ 'foo' │ null  │   │
    │    5    │ 'foo' │ null  │ 1 │
    │    6    │ 'foo' │   1   │   │
    │    7    │ 'foo' │   9   │   │
    └─────────┴───────┴───────┴───┘

vs before your change:

    ┌─────────┬───────┬───────┬───┐
    │ (index) │  str  │ value │ o │
    ├─────────┼───────┼───────┼───┤
    │    0    │ 'bar' │   3   │   │
    │    1    │ 'bar' │ null  │   │
    │    2    │ 'bar' │   7   │   │
    │    3    │ 'bar' │   2   │   │
    │    4    │ 'foo' │ null  │   │
    │    5    │ 'foo' │   9   │   │
    │    6    │ 'foo' │   1   │   │
    │    7    │ 'foo' │ null  │ 1 │
    └─────────┴───────┴───────┴───┘

I think it would be nicer to have nulls at the bottom. I have some old code I borrowed from some react-table examples that we can modify to do this, which produces the output:

    ┌─────────┬───────┬───────┬───┐
    │ (index) │  str  │ value │ o │
    ├─────────┼───────┼───────┼───┤
    │    0    │ 'bar' │   2   │   │
    │    1    │ 'bar' │   3   │   │
    │    2    │ 'bar' │   7   │   │
    │    3    │ 'bar' │ null  │   │
    │    4    │ 'foo' │   1   │   │
    │    5    │ 'foo' │   9   │   │
    │    6    │ 'foo' │ null  │   │
    │    7    │ 'foo' │ null  │ 1 │
    └─────────┴───────┴───────┴───┘

We can configure the order of the empty values to something like NaN, '', null, undefined and get output:

[
  { str: 'foo', value: null },
  { str: 'foo', value: NaN },
  { str: 'foo', value: 9 },
  { str: 'foo', value: 1 },
  { str: 'foo', value: '' },
  { str: 'foo', value: undefined },
  { str: 'foo', value: null, o: 1 },
  { str: 'bar', value: 3 },
  { str: 'bar', value: null },
  { str: 'bar', value: 7 },
  { str: 'bar', value: 2 },
]

    ┌─────────┬───────┬───────┬───┐
    │ (index) │  str  │ value │ o │
    ├─────────┼───────┼───────┼───┤
    │    0    │ 'bar' │   2   │   │
    │    1    │ 'bar' │   3   │   │
    │    2    │ 'bar' │   7   │   │
    │    3    │ 'bar' │ null  │   │
    │    4    │ 'foo' │   1   │   │
    │    5    │ 'foo' │   9   │   │
    │    6    │ 'foo' │  NaN  │   │
    │    7    │ 'foo' │  ''   │   │
    │    8    │ 'foo' │ null  │   │
    │    9    │ 'foo' │ null  │ 1 │
    └─────────┴───────┴───────┴───┘

and for descending value:

    ┌─────────┬───────┬───────┬───┐
    │ (index) │  str  │ value │ o │
    ├─────────┼───────┼───────┼───┤
    │    0    │ 'bar' │   7   │   │
    │    1    │ 'bar' │   3   │   │
    │    2    │ 'bar' │   2   │   │
    │    3    │ 'bar' │ null  │   │
    │    4    │ 'foo' │   9   │   │
    │    5    │ 'foo' │   1   │   │
    │    6    │ 'foo' │  NaN  │   │
    │    7    │ 'foo' │  ''   │   │
    │    8    │ 'foo' │ null  │   │
    │    9    │ 'foo' │ null  │ 1 │
    └─────────┴───────┴───────┴───┘

What do you think? I can modify this PR to include it. Maybe emptys at start vs emptys at end should be an option?

veltman commented 3 years ago

Agree that nulls last is more intuitive by default but one wrinkle is that (per your example above) I'd generally want nulls last whether the sort is asc or desc, and that conflicts with my sense that sort("x") should consistently produce the reverse of sort(desc("x")).

I'd also vote for not lumping empty string in with NaN/null/undefined.

I can think of a variety of possibilities, some of which could complement each other:

  1. asc and desc both automatically sort null/NaN/undefined last (your example)
  2. asc automatically sorts null/NaN/undefined last, desc automatically sorts them first (or vice versa)
  3. separate emptyLast() and emptyFirst() fns, so you can do sort(emptyLast("x"),"x")
  4. separate emptyLast() and emptyFirst() fns that wrap the underlying sort, so you can do sort(emptyLast("x")) or sort(emptyLast(desc("x")))
  5. Introduce an asc() function so that people can always pass a second options arg to control the behavior: asc("x", { sortEmpty: "first" })
  6. Make everyone use fixedOrder in a convoluted way to sort their nulls last

I think #1 is the behavior I'd actually want in my usage, but the other options maybe feel more consistent and explicit - what do you think?

veltman commented 3 years ago

Update: I have learned that there is already a secret undocumented asc function so #5 is halfway done!

pbeshai commented 3 years ago

Ok we can exclude '' from the "empty" list.

I agree it is weird that desc doesn't do the opposite of asc, but I basically always want it to put nulls at the bottom in practice.

As you've noted, we do already have both asc and desc... so I suggest we go with my preferred magic nulls-at-bottom sort, not including ''. Then if we get enough requests, we can add options to asc/desc that allow configuring alternative empty handling.

pbeshai commented 3 years ago

Up in v2.0.8