tidyverse / funs

Collection of low-level functions for working with vctrs
Other
34 stars 7 forks source link

Add `.ptype` and `.size` arguments, adjust `NULL` and empty handling #80

Closed DavisVaughan closed 2 years ago

DavisVaughan commented 2 years ago

Closes #48 Closes #64

A few changes to coalesce() as I prep to port this over to dplyr.

For #48, over slack we decided that we'd rather keep coalesce() simple and always use vctrs invariants, so when given data frames as input it should only ever update entirely missing rows (that is what vec_equal_na() thinks is "missing"). If you want to coalesce by column then you should really just map2() or pmap() over the data frames, calling coalesce() as the function to use on each column. I have added a few tests to reflect this expectation.

For #64, I had suggested we switch to coalesce(x, ...) and always cast to the type of x and recycle to the size of x. I did some research, and the SQL standards state that coalesce() should cast to the common type of all of its inputs. See this link for one example where that is documented https://docs.microsoft.com/en-us/sql/t-sql/language-elements/coalesce-transact-sql?view=sql-server-ver15#return-types. So I think we should keep that behavior by default. It sort of makes sense if you have >=2 full vectors and you don't really have a "primary" input, like coalesce(x, y, z). My original use case was something like coalesce(x, 0), where you'd want to retain the type of x.

hadley commented 2 years ago

If coalesce() errors with 0 inputs, wouldn't you expect coalesce(NULL) to error? I think our general philosophy is that foo(x, NULL,NULL) should be equivalent to foo(x, ,), which is generally equivalent to foo(). I'd prefer to stick to that principle here, if we can.

DavisVaughan commented 2 years ago

If coalesce() errors with 0 inputs, wouldn't you expect coalesce(NULL) to error?

They both error, just with different errors:

coalesce()
#> Error in `coalesce()`:
#> ! `...` must contain at least one input.

coalesce(NULL)
#> Error in `coalesce()`:
#> ! `..1` must be a vector, not NULL.

I think our general philosophy is that foo(x, NULL,NULL) should be equivalent to foo(x, ,)

Yea this is what I was alluding to at the end of the second bullet point: "An alternative to this is to drop all NULL inputs at the beginning". I can do that, that would make coalesce(NULL) fail "automatically" because there would be 0 inputs after removing the nulls.

Are you ok with coalesce(NULL, 1:5) returning 1:5? I guess that is the main question here. (Seems consistent with the idea that NULL inputs are basically equivalent to not providing anything, as you suggested)

DavisVaughan commented 2 years ago

Okay we now stick to this principle:

philosophy is that foo(x, NULL,NULL) should be equivalent to foo(x, ,), which is generally equivalent to foo(x)

Giving us:

coalesce()
#> Error in `coalesce()`:
#> ! `...` must contain at least one input.

coalesce(NULL)
#> Error in `coalesce()`:
#> ! `...` must contain at least one input.

coalesce(c(1, NA), NULL, 2)
#> [1] 1 2

I am fairly certain this is what you wanted, so I'll merge this. Feel free to follow up if not.

hadley commented 2 years ago

Yeah, that's what I was thinking, thanks!

But reading through #64 again, I do find your arguments there compelling, even if it would make coalesce() less consistent with SQL. It's hard to imagine a case where you'd want to recycle the first argument to match the length of the second, and then you might as well apply that principle to the types as well. (And I think the only place where that makes a significant difference is whether coalesce(c(NA, 1L), 1)) returns an integer or double)

DavisVaughan commented 2 years ago

It does mean you can't splice in all of the inputs at once anymore

coalesce2 <- function(x, ...) {
  funs::coalesce(x, ...)
}

vecs <- list(
  c(1, 2, NA, NA, 5),
  c(NA, NA, 3, 4, 5)
)

coalesce2(!!!vecs)
#> Error in !vecs: invalid argument type

funs::coalesce(!!!vecs)
#> [1] 1 2 3 4 5

I think the main question is deciding which of these two is the main use case:

The first favors a signature of coalesce(x, ...) and the second favors coalesce(...).


The other way it is described by most of the SQL docs I found is as syntactic sugar for case_when(), which is why it takes the common type in SQL:

COALESCE(expression1, expression2, ..., expressionN)
CASE  
WHEN (expression1 IS NOT NULL) THEN expression1  
WHEN (expression2 IS NOT NULL) THEN expression2  
...  
ELSE expressionN  
END  

https://docs.microsoft.com/en-us/sql/t-sql/language-elements/coalesce-transact-sql?view=sql-server-ver15#comparing-coalesce-and-case

DavisVaughan commented 2 years ago

Assuming you have GitHub code search: https://cs.github.com/?scopeName=All+repos&scope=&q=language%3Ar+coalesce%28%21%21%21

Approximately 20-ish distinct uses of coalesce(!!!<stuff>) that would break with a coalesce(x, ...) signature

hadley commented 2 years ago

Oh I forgot about the !!! case 😞. We do have a hack from aes() that we could apply here, but I don't think it's worth it.

FWIW I do think in coalesce(x, y, z) x is still primary, since that's the vector that you're filling in.

I guess we could still make the first argument of ... primary, even if we don't give it its own argument. But that starts to maybe feel maybe inconsistent? What other functions do we have that are like coalesce()? pmin() and pmax() are similar, which connects us to the set of functions suggested in #5. I guess it's also related to case_when() and if_else(). if_else() uses the size of the first argument, but the common type of all four.

DavisVaughan commented 2 years ago

Base R's pmin() and pmax() pull attributes from the first input, but I don't think we should let that behavior sway us.

I'd argue that ideally pmin(), pmax(), case_when(), and if_else() should all take the common type by default. I do think all 4 should have optional ptype and size arguments to override the default common type/size computation. That was an improvement I was planning on doing when I rewrite case_when() and if_else() with vctrs.

The main difference between those 4 and coalesce() is that coalesce() seems to have a primary input, and those don't.

However, if we stick with computing the common type for the 4 functions mention above, then this comment would indeed feel inconsistent, as you suggested:

make the first argument of ... primary

hadley commented 2 years ago

Ok, that reasoning sounds solid to me.

One more thought: we should probably add a check_dots_unnamed() here.

DavisVaughan commented 2 years ago

Re: check_dots_unnamed()

There are also a number of cases that code search brought up where people are splicing in data frames. Like they used select() to tidyselect some columns and then spliced those in. So this would break that :/

Here is one example: https://cs.github.com/CorrelAid/projectutils/blob/a2984cc11ca1c88aeb3436c8b661f16356b1903d/R/surveymonkey.R#L127

hadley commented 2 years ago

Oh bummer 😞