pbeshai / tidy

Tidy up your data with JavaScript, inspired by dplyr and the tidyverse
https://pbeshai.github.io/tidy
MIT License
725 stars 21 forks source link

Feedback on unexpected behavior in groupBy #34

Closed mhkeller closed 3 years ago

mhkeller commented 3 years ago

Hi, Thanks for making this library. I've wanted it to exist for years. I've been trying to incorporate into some of my projects and I came across something surprising today.

Here's a REPL reproduction: https://svelte.dev/repl/3d3126f8ea994d3d866427cab0642e3b?version=3.38.2

I wanted to group a list of states based on a value, in this case count. I was getting a really weird output, though. After poking around a bit, I discovered that I fixed the problem by setting { addGroupKeys: false } in the group export. The issue seems to be that because my input data is a list of strings, the default behavior of adding a key onto the element is converting it to an object.

I'm not sure what the best solution is for it – maybe a change in the docs, a console warning or possibly a change in the default behavior so that it doesn't mutate the data by default. My expectation was definitely that it wouldn't mutate the original object, for what it's worth.

pbeshai commented 3 years ago

Ah yes this is a bit of a strange situation. tidy has been designed to work primarily on objects and generally it is unexpected for group keys to disappear on them. For example:

tidy([
  { group: 'group1', b: 2 },
  { group: 'group1', b: 3 },
  { group: 'group2', b: 4 },
],
  groupBy(['group'], [
    summarize({ 
      n: n(),
    }),
  ]),
);

// output: 
[{"n": 2, "group": "group1"}, {"n": 1, "group": "group2"}]

vs

tidy([
  { group: 'group1', b: 2 },
  { group: 'group1', b: 3 },
  { group: 'group2', b: 4 },
],
  groupBy(['group'], [
    summarize({ 
      n: n(),
    }),
  ], { addGroupKeys: false }),
);

// output: 
[{"n": 2}, {"n": 1}]

Using on non-object inputs produces these strange results you're seeing, which can be circumvented (as you've discovered) by not adding group keys. I'm not sure there is a reasonable default behavior to do here. I could potentially inspect each object to see if it is an object before adding in the group keys. This may be a bit expensive (probably would need to verify it is not an array too and then what about Map or Set or something else, but maybe they don't matter as much). Might be worth it though!

Another idea may be to not add group keys by default when using a function as the accessor (e.g. d => d.count === 0). Might be a good choice here. What do you think?

Some alternative approaches to what you did that involve first converting to objects (for the most part):

tidy(
  Object.entries(states).map((d) => ({ id: d[0], ...d[1] })),
  groupBy([(d) => d.count === 0], [], groupBy.entries())
);
// output:
[
  [true, [{ id: '01', name: 'Alabama', count: 0, group_0: true }]],
  [
    false,
    [
      { id: '02', name: 'Alaska', count: 1, group_0: false },
      { id: '04', name: 'Arizona', count: 2, group_0: false },
      { id: '05', name: 'Arkansas', count: 20, group_0: false },
    ],
  ],
];
tidy(
  Object.entries(states).map((d) => ({ id: d[0], data: d[1] })),
  groupBy(
    [(d) => d.data.count === 0],
    [],
    groupBy.entries({ addGroupKeys: false })
  )
);
// output:
[
  [true, [{ id: '01', data: { name: 'Alabama', count: 0 } }]],
  [
    false,
    [
      { id: '02', data: { name: 'Alaska', count: 1 } },
      { id: '04', data: { name: 'Arizona', count: 2 } },
      { id: '05', data: { name: 'Arkansas', count: 20 } },
    ],
  ],
];
tidy(
  Object.entries(states),
  groupBy(
    [(d) => d[1].count === 0],
    [],
    groupBy.entries({
      mapLeaves: (leaves) => Object.fromEntries(leaves),
    })
  )
);
// output:
[
  [true, { '01': { name: 'Alabama', count: 0 } }],
  [
    false,
    {
      '02': { name: 'Alaska', count: 1 },
      '04': { name: 'Arizona', count: 2 },
      '05': { name: 'Arkansas', count: 20 },
    },
  ],
];
tidy(
  Object.entries(states),
  groupBy(
    [(d) => d[1].count === 0],
    [],
    groupBy.entries({
      mapLeaves: (leaves) => Object.fromEntries(leaves),
    })
  )
);
// output:
[
  { '01': { name: 'Alabama', count: 0 } },
  {
    '02': { name: 'Alaska', count: 1 },
    '04': { name: 'Arizona', count: 2 },
    '05': { name: 'Arkansas', count: 20 },
  },
];
pbeshai commented 3 years ago

The main problem really is that summarize removes the group keys. Perhaps an alternative is to have addGroupKeys false by default and somehow detect summarize and have it add them back in?

mhkeller commented 3 years ago

Thanks for the quick and thorough response! Could be just a little warning in the docs is the best solution for now such as:

Note: groupBy assumes each item is an objects. If it is another type, such a a String or Array, you'll likely get unexpected behavior. Setting { addGroupKeys: false } may mitigate issues in some scenarios.

That's a bit vague and can be improved.

To understand the problem more generally, do you think that setting { addGroupKeys: false } is an okay solution if one wanted to use groupBy on a list of strings? Or would that likely cause other problems internally?

mhkeller commented 3 years ago

The main problem really is that summarize removes the group keys. Perhaps an alternative is to have addGroupKeys false by default and somehow detect summarize and have it add them back in?

Interesting. That may be related to something I was seeing where the order of the keys was unexpected. I don't have a sanitized example at the moment but let me see if I can put something together.

Edit: Actually, I see the default playground shows the workaround:

// {T.*}    - all Tidy.js functions are available directly and as T.* 
// {input}  - `input` variable matches mtcars or iris based on input selection
// {output} - put your output in the predefined output variable

output = tidy(input,
  groupBy(['cyl', 'gear'], [
    summarize({ 
      n: n(),
      mpg: mean('mpg'), 
      hp: mean('hp'), 
      wt: mean('wt'), 
    }),
  ]),
  select(['cyl', 'gear', everything()]),
  arrange([desc('n'), desc('mpg')])
);

I thought it was surprising that the original columns cyl and gear were at the end of the spreadsheet instead of at the beginning, where they would be easier to read. Not sure if this is the same root cause, though.

pbeshai commented 3 years ago

Thanks yea at the very least I'll try to make the docs more clear. I do think setting is to false is an okay solution and it's what I've done a couple of times as proof of concept when working on flat arrays of values. Each function is completely independent, so I wouldn't expect you to have a cascade of problems... but each function is designed to work with objects. To get around this, you can often pass an accessor that is an identity function e.g. d => d instead of a string key, but your mileage may vary. (e.g., tidy([4,0,5,5,1,4], distinct(d => d)) gives [4,0,5,1]).

I tend to keep things as objects until I'm done wrangling them.

And yes, right now the group keys are appended, which ensures they overwrite an identically named property on the object, but has the side effect of moving them to the end of the object's key list. Perhaps they should be pre-pended which would keep them at the front or in the original location if they were not removed from the object.

FWIW if you're working with just simple values and not doing much wrangling, you may find it convenient enough to just use the d3-array groups or rollups commands. e.g.

d3.groups(Object.entries(states), d => d[1].count === 0)
// output:
[
  [true, [["01", {"name": "Alabama", "count": 0}]]],
  [
    false,
    [
      ["02", {"name": "Alaska", "count": 1}],
      ["04", {"name": "Arizona", "count": 2}],
      ["05", {"name": "Arkansas", "count": 20}]
    ]
  ]
]
ritchieking commented 3 years ago

oh yeah ... interesting. okay here's a thought:

the concept of "adding group keys" is a little weird since the "adding" that generally happens is really just a side effect of the fact that summarize strips them away and they need to be reintroduced (as you point out @pbeshai). From a user's perspective, it looks like they never left.

But accessors are different, and I think it's a good idea to treat them differently. While I would expect keys to be maintained, I would want to opt in to adding the result of an accessor as a new field.

SO what if addGroupKeys went away ... and was replaced with an option called addAccessorKeys which is false by default. Turn it on, and you get the accessor results in your data. Either way, any non-accessor keys are maintained (if you really want to get rid of them, you can always do select(['-groupKey1', '-groupKey2']))

Related to the appending vs prepending question: I've wondered if something to assist with exporting for download could be a good addition. Something like:

const csvData = tidy(input,
  groupBy(['cyl', 'gear'], [
    summarize({ 
      n: n(),
      mpg: mean('mpg'), 
      hp: mean('hp'), 
      wt: mean('wt'), 
    }),
  ]),
  arrange([desc('n'), desc('mpg')]),
  toCSV([
    { 'cyl': 'cylinders'},
    'gear',
    everything(),
    { n: 'total'}
  ])
);

^^ that array would both order and rename columns so you can have easy control over the output.

veltman commented 3 years ago

Maybe this is my d3.nestbrain talking but I think the default behavior I would expect is not to add the group keys back. I feel like once I've nested the objects and started the flow, if I want to transform the leaf array into something fun and cool that loses the keys in the process, that's my business!

Even when I "lose" the key by summarizing I don't really lose it, because I will typically export into my preferred shape with .entries(), .object(), etc. If I really just want an array of summary objects that keeps the keys, I do have options like adding cyl: first('cyl') to the summarize spec.

To me adding the keys should feel more like a type of export you choose, i.e. groupBy.arrayWithKeys(), rather than something that happens always. If I specify nothing I want whatever the output of the last function in the flow is to be the output.

This also avoids accessor vs. string problems, and makes it so summarize() produces the same summary object both inside and outside a groupBy flow. Also, in an example like:

tidy(input,
  groupBy(['cyl'], [
    summarize({ 
      n: n(),
      mpg: mean('mpg'), 
    }),
  ], groupBy.object({ single: true })),
);

I don't want the value objects in my dict to have redundant cyl keys, I already have the keys... as the keys.


In a case where the group keys are going to get added back, I'm team prepend for two reasons:

1) if the object already has a key with the same name at the end of the groupBy flow, I would consider it a bug if my value gets overwritten. 2) Aesthetically groupKeys being first instead of last feels more natural. If I were going to console.table() some grouped data I would want to see the grouping on the left side, not the right. Counterpoint: maybe this is pure LTR provincialism!

pbeshai commented 3 years ago

Thanks all for your input on this. I think this is the course of action I will take:

To add in now

  1. switch to prepend group keys when assigning them back into objects
  2. check whether or not something is an object before trying to spread it when assigning keys back
  3. never assign back in accessor (function) keys. if someone wants them, they can do a mutate before grouping.

For later

These need more discussion and won't happen for a while, maybe never... but I'm thinking:

  1. [breaking] addGroupKeys is always false by default? Note current behavior is to assign group keys after each function in the groupBy flow. Maybe it should only be an option for at the very end and default to false? Default to false for exports but not for the default ungrouped behavior (flat array)?
  2. summarize could take an option to addGroupKeys back in instead of relying on cyl: first('cyl')

Noah kinda convinced me I shouldn't be adding them in by default, but it would be a big breaking change for them to disappear, and I'm not sure it will ever be worth causing that kind of havoc.

pbeshai commented 3 years ago

@mhkeller your svelte repl now... works? ;) v2.4.0 has the fixes described above. I'm going to close this issue. I'll open another one to handle discussion about switching to never adding group keys.

And now you can just write:

tidy(
  Object.keys(states), 
  groupBy(d => states[d].count === 0, groupBy.entries())
);