ORDER BY is ignored in subqueries without LIMIT is shown too often and often when it doesn't make sense

nathaneastwood commented 3 years ago

Ever since upgrading to the latest version of dbplyr, our code output has been riddled with the warning

1: ORDER BY is ignored in subqueries without LIMIT
ℹ Do you need to move arrange() later in the pipeline or use window_order() instead?

A simple, reproducible example would be:

r$> sc <- sparklyr::spark_connect(master = "local")
r$> mtcars_spark <- dplyr::copy_to(dest = sc, df = mtcars)
r$> mtcars_spark %>% dplyr::distinct(am)                                                                                                                        
# Source: spark<?> [?? x 1]
     am
  <dbl>
1     1
2     0
Warning messages:
1: ORDER BY is ignored in subqueries without LIMIT
ℹ Do you need to move arrange() later in the pipeline or use window_order() instead? 
2: ORDER BY is ignored in subqueries without LIMIT
ℹ Do you need to move arrange() later in the pipeline or use window_order() instead? 
3: ORDER BY is ignored in subqueries without LIMIT
ℹ Do you need to move arrange() later in the pipeline or use window_order() instead? 
4: ORDER BY is ignored in subqueries without LIMIT
ℹ Do you need to move arrange() later in the pipeline or use window_order() instead? 
5: ORDER BY is ignored in subqueries without LIMIT
ℹ Do you need to move arrange() later in the pipeline or use window_order() instead?

But we see these warnings in order of 50+ warnings for most of our functionality now. We purposefully avoid use of arrange() in our code due to old dbplyr behaviour where you couldn't un-arrange and so any code we have which needs an order, we make use of dplyr::mutate(dbplyr::win_over()) and specify the ordering in there. One might thing this is the cause of the problem but it doesn't always make sense to use an order in here, you might just want to perform some logic over groups, for example. Coupled with the example above, I think this warning may need reviewing?

Looking at the query that this generates, it seems a little more complex than is needed? I am not so sure about all other backends but in Spark, it is enough to do

sparklyr::sdf_sql(sc = sc, "SELECT DISTINCT(mpg) from mtcars")

yitao-li commented 3 years ago

Sparklyr had to implement some custom query generation for distinct to support the .keep_all parameter and also to conform to the ordering behavior of dplyr::distinct on R dataframes.

If you can have results returned in no particular order and won't need .keep_all = TRUE, then "SELECT DISTINCT(...)" is indeed a simpler alternative.

nathaneastwood commented 3 years ago

Thanks for getting back to me @yitao-li. I am wondering what is the best approach for distinct() then moving forwards - thinking about performance. Presumably the Spark interpreter will do a lot of the work but otherwise just from a general user perspective as well, seeing the output of mtcars_spark %>% dplyr::distinct() can be quite confusing. I would assume that the order should be informed by the user and dbplyr shouldn't be warning me about the order otherwise.

Maybe the sparklyr implementation can have some logic which checks whether the data are ordered and look at the value of .keep_all and decide how best to generate the query from there?

Also, putting distinct() to one side, this warning is still popping up a lot in our code since the update. I am not sure why and it is difficult to debug when and where it occurs when working with large data.

yitao-li commented 3 years ago

@nathaneastwood I think the sdf_distinct() interface implemented by Jozef is the perfect solution. It will be shipped as part of sparklyr 1.6.1 (a follow-up release after the 1.6.0 release)

Reason sdf_distinct() makes sense: in sparklyr the family of functions prefixed with sdf_ generally access the Scala Spark DataFrame API directly, as opposed to the dplyr interface which uses Spark SQL.

Also, dplyr::distinct() for Spark dataframes became a bit more complicated than a simple SQL distinct operation because the output of dplyr::distinct() must satisfy the following requirements

    • Rows are a subset of the input but appear in the same order.
    • Columns are not modified if ‘...’ is empty or ‘.keep_all’ is
      ‘TRUE’. Otherwise, ‘distinct()’ first calls ‘mutate()’ to
      create new columns.
    • Groups are not modified.
    • Data frame attributes are preserved.

The first two requirements were what made the implementation slightly less simple for Spark dataframes.

nathaneastwood commented 3 years ago

Hi @yitao-li. I understand the difference between the dplyr implementation (tbl_spark) vs. the Scala methods (sdf_). Personally, I think restricting sparklyr in such a way that it has to match the dplyr API is the wrong way to go. In my opinion it should be performance > features, after all that is why people are presumably using Spark? However I appreciate that there are a lot of people who know and use the dplyr API and these features are useful. We are using sdf_distinct() internally for now, but one issue is that you lose the ability to see the full SQL output with dbplyr::sql_render() since it of course utilises the Scala API.

yitao-li commented 3 years ago

@nathaneastwood I have implemented an option for disabling the current tbl_spark implementation in https://github.com/sparklyr/sparklyr/pull/2979

Example usage:

options(sparklyr.dplyr_distinct.impl = "tbl_lazy")

sdf %>% dplyr::distinct()  # will simply run select distinct * from <input tbl>

nathaneastwood commented 3 years ago

Thanks @yitao-li, this looks good! Thanks a lot.

tidyverse / dbplyr

ORDER BY is ignored in subqueries without LIMIT is shown too often and often when it doesn't make sense #611