saulpw / visidata

A terminal spreadsheet multitool for discovering and arranging data
http://visidata.org
GNU General Public License v3.0
7.78k stars 276 forks source link

Merge has surprising result #1843

Closed reagle closed 2 days ago

reagle commented 1 year ago

I'm creating toy examples for an Anki deck and find unexpected behavior for merge, which I thought might mean: "keep all rows from first sheet and matching rows from other sheets, updating the first sheet's data with the first non-False value." (The intro says, "Merges differences from other sheets into first sheet"; the quick reference says, "merge mostly keep all rows from first selected sheet, except prioritise cells with non-null/non-error values.")

ID  Name    Age
1   Alice   25
2   Bob 30
3   Carol   
4   David   40
ID  Name    Age
2       35
3       30
5   Emily   28

After setting ID as the key column, when making the first sheet the uppermost (shift+k) and merging (& merge), I'd expect:

ID  Name    Age
1   Alice   25
2   Bob 35
3   Carol   30
4   David   40

But I get:

ID  Name    Age
1   Alice   25
2   Bob 30
3   Carol   30
4   David   40
5   Emily   28

I'm surprised Emily is included and that Bob's age is not 35.

The text files are in this zip. Using v2.11.

test.zip

reagle commented 1 year ago

BTW: the fact that Bob and Carol's name are not removed is intuitive: I imagine because empty/null value isNull or Falsey. So that's good/expected.

yphillip commented 1 year ago

I'm surprised Emily is included

I was looking at the example data set from the original feature request in #405. It seems the implementation matches the example provided the feature request: rows 8, 10, and 11 are not present in the initial table, are present in the new table, and then are present in the merged table.

I'm curious to hear what you think the correct intended behavior should be for new rows. Or maybe the documentation should be updated to clarify this behavior?

... and that Bob's age is not 35

I agree and also expected that Bob's age to be 35 instead of 30 in the merged table. I would be interested in taking a stab at this bugfix.

saulpw commented 1 year ago

@yphillip If you wanted to take a stab at it, that would be much appreciated! I'd be happy to review at any point.

reagle commented 1 year ago

@yphillip I don't feel strongly what the behavior should be; my concern is that it do what a clear description says it will do. I think it makes sense to follow well-established behaviors/conventions (e.g., from SQL).

reagle commented 10 months ago

I’m glad to see this addressed, but I’m not sure what the result is. Does the documentation need to be updated?

saulpw commented 10 months ago

I'm relying on @yphillip's PR #1923 making a reasonable fix, but I haven't taken a deep look into it. Would you be up for checking out the join-merge test and seeing if it does as you expect now? It looks like the fix was to just reverse the order of columns when calculating the merged values.

yphillip commented 10 months ago

I'm surprised Emily is included and that Bob's age is not 35.

Post-bugfix, the result is that Bob's age is now 35 (as you were expecting) and Emily is included (which is not what you were expecting).

ID  Name    Age
1   Alice   25
2   Bob 35
3   Carol   30
4   David   40
5   Emily   28

Emily is included because it seems like that's the behavior that the original feature request described: https://github.com/saulpw/visidata/issues/405 (the join-merge test case)

Left table does not have rows with colA 8, 10, and 11, Right table does have rows with colA 8, 10, and 11 join-merged table does have rows with colA 8, 10, and 11

To summarize, this is the behavior now:

"keep all rows from first sheet and matching rows from other sheets, updating the first sheet's data with the first non-False value. Also, add new rows from second sheet that were not present in first sheet."

Does the documentation need to be updated?

Probably a good idea so that users know exactly what merge does without having to read these issues and PR conversations :) Will leave it up to you all on how/when to update the documentation.

saulpw commented 10 months ago

@jreagle Would you be willing to update the documentation for this issue? I think you might have some good ideas about how to communicate this, especially since you stumbled on it originally and might remember how you looked for the information.

reagle commented 10 months ago

@saulpw, I was originally pulling from two pieces of documentation:

  1. I believe @anjakefala created/maintains "quick reference"; I thought the doc was maintained in groff, though I see a a file here with a markdown extension and whose content is mostly HTML here. Perhaps they can speak to how to update that doc?
  2. @jsvine maintains the introduction, which says it takes its direction from the quick reference. Perhaps once the quick reference is addressed, they might be willing to give a bit more of a description using the new understanding?
jsvine commented 10 months ago

@reagle: Thanks and yes, I'd be happy to tweak the tutorial's description of the merge-join once the quick reference language is finalized.

saulpw commented 10 months ago

@reagle yes, the quick reference is old-school manpage groff format, though it's such a pain to maintain that we should probably migrate to a simpler if less featureful source format. For now though, if we know what we want it to say, @anjakefala should be able to update the actual source file in short order. So it sounds like we basically just need the wording that should go in the manpage/quickref guide, and then that will trickle down into the tutorial.

anjakefala commented 10 months ago

@reagle Just to add that the .md you found is actually created from the groff file (https://github.com/saulpw/visidata/blob/develop/dev/mkman.sh). I put that together 6 years ago, it was my first major VisiData contribution. If I built it today, I would not use groff, and would have tried to find a way to avoid checking in generated files. =)

But yes! Let me know the text, and I will add it to the manpage.

reagle commented 10 months ago

@anjakefala, @yphillip wrote "keep all rows from first sheet and matching rows from other sheets, updating the first sheet's data with the first non-False value. Also, add new rows from second sheet that were not present in first sheet." I know that's more than twice as long as the current prose, but I'm hesitant to edit for fear of changing it. If you need it shorter, I'd defer to @yphillip .

anjakefala commented 10 months ago

@reagle @jsvine Updated! https://github.com/saulpw/visidata/commit/1599b9193eac4e3c1f3d67e700235b7ca1551f8d

jsvine commented 10 months ago

Thanks! Just to make sure I'm translating correctly for the tutorial: What counts as "False-y"? As I understand it, different languages/contexts define this in different ways, yeah? E.g., is the integer 0 False-y?

saulpw commented 10 months ago

@jsvine It's currently the Python False-y (https://github.com/saulpw/visidata/blob/develop/visidata/features/join.py#L159). We did because we didn't want an empty string to take precedent; this may have been before options.null_value, but that still seems unwieldy for this join. I'm open to discussion, but for now let's document the 0/False/empty string/empty list behavior.

jsvine commented 9 months ago

Thanks, and now updated: https://github.com/jsvine/intro-to-visidata/commit/c32818517c31ade31d05eef116bff4baa96da376

reagle commented 1 month ago

BTW: I noticed in my Anki tutorial that the description did not match the examples: it showed Bob's age was 35, not 30. I've fixed this, but I think the in-program hint still needs help.

current: "merge - merge differences from other sheets into first sheet (including new rows)"

"difference" is too inclusive; I wrestled with something more accurate, and I think this is, though it is also a little longer:

proposed: "merge - all rows from all sheets; where keys match, update first sheet's Falsey values"

saulpw commented 3 days ago

This is a tricky one to explain. The merge join uses the most recent truthy value; if muiltiple sheets are joined, the latest truthy value is used, even if earlier sheets had truthy values. So, more than just 'False-y' values are updated.

I propose "'all rows from all sheets; where keys match, use latest truthy value". What do we think about this?

reagle commented 3 days ago

Very good! BTW: Looking around, it seems that Falsy (without an 'e') is the common spelling.

midichef commented 2 days ago

While we're on the topic of joins, there is another surprising behavior that I'd like to see documented: how key columns are matched for inner/outer joins.

When I first used joins, I expected that key column matches would be based on column names. For example, if a sheet has two key columns:

keyYear | keyMonth || TransactionCount

and I am joining it to another, I thought the join would be made by finding key columns called keyYear and keyMonth..

Instead, it seems like visidata just matches by the key columns by lining them up in order, regardless of name: the 1st key column value in sheet A must match the value under 1st key column in sheet B; the value in the 2nd column in sheet A has to match the value in the 2nd column in sheet B, etc.

I think these comparisons should be explained somewhere more prominently.

In particular, there is a join that I naively expected would work: joining two sheets with key columns that have identical names, but in different order in each sheet.

saulpw commented 2 days ago

I wanted people to be able to join without having to rename columns. We could make it check for same-named columns first, and only fall-back to column order if all names don't match. Does that seem reasonable?

reagle commented 2 days ago

That surprises me too @midichef, but I’ve only done single column joins so didn’t encounter this.