tidyverse / dbplyr

Database (DBI) backend for dplyr
https://dbplyr.tidyverse.org
Other
474 stars 173 forks source link

MSSQL slice_sample() translation always returns the same rows #1503

Open thomashulst opened 4 months ago

thomashulst commented 4 months ago

Using slice_sample() on an MSSQL backend always return the same rows:

df <- tbl(con, "mtcars")
df |> slice_sample(n = 3)
#> # Source:   SQL [3 x 12]
#> # Database: Microsoft SQL Server 14.00.2052[***]
#>   car            mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <chr>        <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Mazda RX4     21       6   160   110  3.9   2.62  16.5     0     1     4     4
#> 2 Mazda RX4 W…  21       6   160   110  3.9   2.88  17.0     0     1     4     4
#> 3 Datsun 710    22.8     4   108    93  3.85  2.32  18.6     1     1     4     1
df |> slice_sample(n = 3)
#> # Source:   SQL [3 x 12]
#> # Database: Microsoft SQL Server 14.00.2052[***]
#>   car            mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <chr>        <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Mazda RX4     21       6   160   110  3.9   2.62  16.5     0     1     4     4
#> 2 Mazda RX4 W…  21       6   160   110  3.9   2.88  17.0     0     1     4     4
#> 3 Datsun 710    22.8     4   108    93  3.85  2.32  18.6     1     1     4     1
df |> slice_sample(n = 3)
#> # Source:   SQL [3 x 12]
#> # Database: Microsoft SQL Server 14.00.2052[***]
#>   car            mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <chr>        <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Mazda RX4     21       6   160   110  3.9   2.62  16.5     0     1     4     4
#> 2 Mazda RX4 W…  21       6   160   110  3.9   2.88  17.0     0     1     4     4
#> 3 Datsun 710    22.8     4   108    93  3.85  2.32  18.6     1     1     4     1

Created on 2024-05-16 with reprex v2.1.0

The issue is with the RAND() function. RAND() on an MSSQL backend will always return the same value for each row in the result set. As such, ordering the rows by the return value of RAND() and selecting the top n rows will just return the data in its initial order:

dbGetQuery(con, "SELECT TOP(3) *, RAND() as rand FROM mtcars")
#>             car  mpg cyl disp  hp drat    wt  qsec vs am gear carb      rand
#> 1     Mazda RX4 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4 0.1899416
#> 2 Mazda RX4 Wag 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4 0.1899416
#> 3    Datsun 710 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1 0.1899416

Created on 2024-05-16 with reprex v2.1.0

An alternative approach to generating random numbers on an MSSQL backend is given here. It works by using NEWID() to return a globally unique identifer for each row, hashing the GUID with CHECKSUM() and using this as a seed for RAND():

dbGetQuery(con, "SELECT TOP(3) *, RAND(CHECKSUM(NEWID())) as rand FROM mtcars")
#>             car  mpg cyl disp  hp drat    wt  qsec vs am gear carb       rand
#> 1     Mazda RX4 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4 0.78626466
#> 2 Mazda RX4 Wag 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4 0.31999640
#> 3    Datsun 710 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1 0.03604049

Created on 2024-05-16 with reprex v2.1.0

Changing the runif() function translation in the MSSQL backend should be straightforward:

      runif = function(n = n(), min = 0, max = 1) {
        sql_runif(RAND(CHECKSUM(NEWID())), n = {{ n }}, min = min, max = max)
      }