Feedback on tidyup 2: stringr <-> tidyr realignment

hadley commented 2 years ago

@tidyverse/authors, @AmeliaMN, @rpruim, @hardin47, @rundel, @beanumber, @gagolews, @rjpat, @jminnier:

We’d love to get your thoughts on https://github.com/tidyverse/tidyups/blob/main/002-tidyr-stringr.md, a proposal to align the functions that break apart strings into variables/observations in stringr and tidyr. The goal is to come up with a cohesive family of string manipulation functions that tackle common problems that arise during data tidying. You'd first encounter the high level tidyr functions that work with variables in data frames, and later, when you learn more about programming and string manipulation, you'd learn about the lower level stringr functions that work with character vectors.

Please feel free to contribute however you feel comfortable — you're welcome to make small fixes via PR, comment here, or open bigger discussion topics in an new issue. If there are things you’d prefer to discuss in private, please feel free to email me. I’ll plan to close the discussion on Jan 17, so we can start on implementation early next year.

eutwt commented 2 years ago

Edit: I should have waited a day to think about this, because I'm now convinced I was wrong and the names are good :). Previous comment is below -------- One small change I'd suggest to consider is making the names more explicit. I think "separate at" and "separate by" each apply just as well to "separate at/by" delimiters or "separate at/by" location, so I'd have to check which one is which. The downside is this would require abbreviations (or unreasonably long function names), and would come across less like "regular speech".

For example:

str_separate_loc_[wider/longer] - separate at fixed integer locations ("loc" replaces "at")
str_separate_delim_[wider/longer] - separate at specified delimiters ("delim" replaces "by")

I think it would be pretty easy to choose names that are almost completely unambiguous if you're willing to throw away the constraint that the name sounds like an actual spoken phrase, and I think that tradeoff would be worth it.

In common speech, I'd say I want to "separate by commas" and would of course never literally say "separate delim", so in that sense it's definitely less natural. But the more natural phrasing is only possible because it's part of a sentence/conversation providing additional context which you can't capture in a simple function name.

hadley commented 2 years ago

Worth considering if str_longer_*() all needs some arguments to help with inconsistent lengths: https://github.com/tidyverse/tidyr/issues/1166

Need to consider what the default separator should be: https://github.com/tidyverse/tidyr/issues/1197

DavisVaughan commented 2 years ago

It may be that str_separate_by_longer() doesn't have the inconsistent length arguments, but separate_by_longer() does.

With str_separate_by_wider() and separate_by_wider(), both take 1 vector/column and expand it into a data frame of many columns.

With str_separate_by_longer(), you take 1 vector and expand it long-ways, which never has inconsistent lengths. But with separate_by_longer() you'll probably take multiple columns (like separate-rows), and that is where the inconsistent lengths would come in.

Alternatively, str_separate_by_longer() could have some way to take multiple vectors at once, but that seems difficult to come up with.

AmeliaMN commented 2 years ago

I'm glad you're thinking through this! It relates to my aphorism of "most times I think I want a str_ function I actually want separate."

I've read through the .md and I think I get the broad idea-- there will still be two categories of functions, one that works on vectors (the ones starting with just str_separate_) and one that works on dataframes (starting with separate_). One improvement is having all the ones that work on dataframes start with separate_ instead of having separate(), extract() and separate_rows(). Another improvement is that the ones starting with str_ will... play nicely with things like across()? And the final improvement (as I see it) is you're filling in some gaps where functions should have existed, but didn't.

I find the argument for the "One function for each case" solution compelling. My only hesitation with the table of suggested function names is that I might still start typing separate_ when I really want str_separate_ and vice versa. Obviously, having good documentation and error messages when you mess it up will help a lot. But I wonder if one or the other should not have "separate" in the name at all. I see your point about the other synonyms having somewhat negative connotations, but I think "divide" is pretty neutral, and stringr already has str_split. I assume the stringr functions will be lesser-used than the tidyr functions, so maybe stringr gets the worse name? That would also allow tidyr to continue having functions that start separate.

My final thought is that it would be easier for me to think about each of these functions with an example of each one in standard use (like the examples that would be at the bottom of the function documentation) at the bottom of the md. I assume you're working on these examples somewhere, but early-January-me doesn't have the wherewithal to go find them.

rundel commented 2 years ago

Some quick thoughts in no particular order:

I'm not a huge fan of the group naming in seperate_group_longer based on mostly the same concerns listed in the document. Coming in with the baggage of heavy stringr usage I would prefer something like separate_match for the sake of consistency but then str_match and str_extract has definitely been a point of confusion for myself and students.
My initial reaction is that having three clauses in a function name seems potentially burdensome when working through which variant to use (e.g. all 5 functions show up when using tab completion). If one clause could be dropped I think that the wider / longer variants may not be necessary and could be replaced with function arguments that determine the unnesting. My thinking is based on the following:
- I think the wider case is the most common usecase, followed by accidentally ragged data (due to either data errors or regex errors), and then finally truly ragged data as the least common. Supporting wider as the default with errors or warnings on varying lengths indicating the problematic inputs and advising correcting the data / pattern or using unnest = "longer" potentially.
- I may be an edge case but it would be nice to have the option to create a list column with no unnesting in some cases while not having to resort back to the original stringr functions. This would not seem to fit very well with the proposed functions.
- I have not thought carefully about what the necessary arguments for the proposed functions would be, but it seems like most of the arguments between wider and longer would be shared other than col_names and related for the wider case. To me this means there is potentially less need to separate the functions than in case of pivot_wider and pivot_longer.
Adding stringr as a dependency for the sake of consistency is fine by me - I've spent more time than I would like waiting on stringi to compile but its a relatively fixed cost that package manager makes even lower.

hadley commented 2 years ago

@eutwt ok, great, glad you've come to like the names 😄

@DavisVaughan @rundel Maybe the stringr functions shouldn't have the wider/longer distinction and just return lists of characters vectors? Then it's the job of tidyr to assemble into a data frame by putting each entry in either either a row or a column? Then str_separate_by() would just be str_split() and we'd only need a new function for str_split_by()?

@AmeliaMN I feel like str_separate() already is the worser name, but maybe (based on the previous paragraph) stringr wouldn't actually need any separate functions? You're right that the main benefit of stringr functions is to work nicely with across(), but alternatively we could make the tidyr functions take multiple columns so you didn't need to use across().

@AmeliaMN good point about the lack of examples. I'll add a few imaginary cases when I next update the doc.

rundel commented 2 years ago

@hadley that makes sense to me, since stringr has never explicitly been designed around dealing with data frames it seems fine to not add that functionality. One thing that comes to mind is that the new functions may have some quality of life improvements that might be worth backporting to the stringr functions via optional arguments.

For example with str_match having the ability to exclude the complete match and only return the capture groups would be useful. Similarly having the optional ability to return a list of named vectors or a matrix with column names when using names capture groups would also be occasionally useful.

hadley commented 2 years ago

@rundel I've thought about that too, but any argument would be longer than [-1, ] so it doesn't seem that worthwhile. And your second wish has already been granted in the dev version 😉

rundel commented 2 years ago

@hadley sounds good. For the first example the use case I had in mind was actually for str_match_all since the [-1,] needs to be lapply'd or map'd over the results adding an extra step. I mostly bring it up because it came up a handful of times with advent of code this year and so is fresh in my mind, but then the proposed tidyr functions are the better solution to all of this overall.

hadley commented 2 years ago

You can see a preliminary implementation of the tidyr side at https://github.com/tidyverse/tidyr/pull/1304. The names aren't final, but one thing that @mine-cetinkaya-rundel and I realised when looking at real functions is that we put the wider/longer in the wrong place — the problem decomposes better if we have separate_wider_() for making columns and separate_longer_() for making rows, and then the prefixes describe exactly how we break down the string.

tidyverse / tidyups

Feedback on tidyup 2: stringr <-> tidyr realignment #23