Closed reagle closed 2 days ago
BTW: the fact that Bob and Carol's name are not removed is intuitive: I imagine because empty/null value isNull or Falsey. So that's good/expected.
I'm surprised Emily is included
I was looking at the example data set from the original feature request in #405. It seems the implementation matches the example provided the feature request: rows 8, 10, and 11 are not present in the initial table, are present in the new table, and then are present in the merged table.
I'm curious to hear what you think the correct intended behavior should be for new rows. Or maybe the documentation should be updated to clarify this behavior?
... and that Bob's age is not 35
I agree and also expected that Bob's age to be 35 instead of 30 in the merged table. I would be interested in taking a stab at this bugfix.
@yphillip If you wanted to take a stab at it, that would be much appreciated! I'd be happy to review at any point.
@yphillip I don't feel strongly what the behavior should be; my concern is that it do what a clear description says it will do. I think it makes sense to follow well-established behaviors/conventions (e.g., from SQL).
I’m glad to see this addressed, but I’m not sure what the result is. Does the documentation need to be updated?
I'm relying on @yphillip's PR #1923 making a reasonable fix, but I haven't taken a deep look into it. Would you be up for checking out the join-merge test and seeing if it does as you expect now? It looks like the fix was to just reverse the order of columns when calculating the merged values.
I'm surprised Emily is included and that Bob's age is not 35.
Post-bugfix, the result is that Bob's age is now 35 (as you were expecting) and Emily is included (which is not what you were expecting).
ID Name Age
1 Alice 25
2 Bob 35
3 Carol 30
4 David 40
5 Emily 28
Emily is included because it seems like that's the behavior that the original feature request described: https://github.com/saulpw/visidata/issues/405 (the join-merge test case)
Left table does not have rows with colA 8, 10, and 11, Right table does have rows with colA 8, 10, and 11 join-merged table does have rows with colA 8, 10, and 11
To summarize, this is the behavior now:
"keep all rows from first sheet and matching rows from other sheets, updating the first sheet's data with the first non-False value. Also, add new rows from second sheet that were not present in first sheet."
Does the documentation need to be updated?
Probably a good idea so that users know exactly what merge does without having to read these issues and PR conversations :) Will leave it up to you all on how/when to update the documentation.
@jreagle Would you be willing to update the documentation for this issue? I think you might have some good ideas about how to communicate this, especially since you stumbled on it originally and might remember how you looked for the information.
@saulpw, I was originally pulling from two pieces of documentation:
@reagle: Thanks and yes, I'd be happy to tweak the tutorial's description of the merge-join once the quick reference language is finalized.
@reagle yes, the quick reference is old-school manpage groff format, though it's such a pain to maintain that we should probably migrate to a simpler if less featureful source format. For now though, if we know what we want it to say, @anjakefala should be able to update the actual source file in short order. So it sounds like we basically just need the wording that should go in the manpage/quickref guide, and then that will trickle down into the tutorial.
@reagle Just to add that the .md
you found is actually created from the groff file (https://github.com/saulpw/visidata/blob/develop/dev/mkman.sh). I put that together 6 years ago, it was my first major VisiData contribution. If I built it today, I would not use groff, and would have tried to find a way to avoid checking in generated files. =)
But yes! Let me know the text, and I will add it to the manpage.
@anjakefala, @yphillip wrote "keep all rows from first sheet and matching rows from other sheets, updating the first sheet's data with the first non-False value. Also, add new rows from second sheet that were not present in first sheet." I know that's more than twice as long as the current prose, but I'm hesitant to edit for fear of changing it. If you need it shorter, I'd defer to @yphillip .
@reagle @jsvine Updated! https://github.com/saulpw/visidata/commit/1599b9193eac4e3c1f3d67e700235b7ca1551f8d
Thanks! Just to make sure I'm translating correctly for the tutorial: What counts as "False-y"? As I understand it, different languages/contexts define this in different ways, yeah? E.g., is the integer 0
False-y?
@jsvine It's currently the Python False-y (https://github.com/saulpw/visidata/blob/develop/visidata/features/join.py#L159). We did because we didn't want an empty string to take precedent; this may have been before options.null_value
, but that still seems unwieldy for this join. I'm open to discussion, but for now let's document the 0/False/empty string/empty list behavior.
Thanks, and now updated: https://github.com/jsvine/intro-to-visidata/commit/c32818517c31ade31d05eef116bff4baa96da376
BTW: I noticed in my Anki tutorial that the description did not match the examples: it showed Bob's age was 35, not 30. I've fixed this, but I think the in-program hint still needs help.
current: "merge - merge differences from other sheets into first sheet (including new rows)"
"difference" is too inclusive; I wrestled with something more accurate, and I think this is, though it is also a little longer:
proposed: "merge - all rows from all sheets; where keys match, update first sheet's Falsey
values"
This is a tricky one to explain. The merge
join uses the most recent truthy value; if muiltiple sheets are joined, the latest truthy value is used, even if earlier sheets had truthy values. So, more than just 'False-y' values are updated.
I propose "'all rows from all sheets; where keys match, use latest truthy value". What do we think about this?
Very good! BTW: Looking around, it seems that Falsy
(without an 'e') is the common spelling.
While we're on the topic of joins, there is another surprising behavior that I'd like to see documented: how key columns are matched for inner/outer joins.
When I first used joins, I expected that key column matches would be based on column names. For example, if a sheet has two key columns:
keyYear | keyMonth || TransactionCount
and I am joining it to another, I thought the join would be made by finding key columns called keyYear
and keyMonth.
.
Instead, it seems like visidata just matches by the key columns by lining them up in order, regardless of name: the 1st key column value in sheet A must match the value under 1st key column in sheet B; the value in the 2nd column in sheet A has to match the value in the 2nd column in sheet B, etc.
I think these comparisons should be explained somewhere more prominently.
In particular, there is a join that I naively expected would work: joining two sheets with key columns that have identical names, but in different order in each sheet.
I wanted people to be able to join without having to rename columns. We could make it check for same-named columns first, and only fall-back to column order if all names don't match. Does that seem reasonable?
That surprises me too @midichef, but I’ve only done single column joins so didn’t encounter this.
I'm creating toy examples for an Anki deck and find unexpected behavior for merge, which I thought might mean: "keep all rows from first sheet and matching rows from other sheets, updating the first sheet's data with the first non-False value." (The intro says, "Merges differences from other sheets into first sheet"; the quick reference says, "merge mostly keep all rows from first selected sheet, except prioritise cells with non-null/non-error values.")
After setting ID as the key column, when making the first sheet the uppermost (
shift+k
) and merging (& merge
), I'd expect:But I get:
I'm surprised Emily is included and that Bob's age is not 35.
The text files are in this zip. Using v2.11.
test.zip