sparklyr / sparklyr

R interface for Apache Spark
https://spark.rstudio.com/
Apache License 2.0
949 stars 308 forks source link

possible issue with sdf %>% dplyr::distinct(x) followed by dplyr::arrange(x) in `sparklyr` #2778

Open yitao-li opened 3 years ago

yitao-li commented 3 years ago

appears to be related to changes in dbplyr 2.0 -- dplyr::pull(x) is no longer order-preserving after dplyr::arrange(x)

mzorko commented 3 years ago

Hi, what is the status with this?

library(sparklyr)
library(dplyr)
sc <- sparklyr::spark_connect("local", method = "shell", version = "3.0.0")
iris_tbl <- iris %>% sparklyr::sdf_copy_to(sc, ., sparklyr::random_string())
iris_tbl %>% sparklyr::distinct(Species)
# Source: spark<?> [?? x 1]
  Species   
  <chr>     
1 setosa    
2 versicolor
3 virginica 
Warning messages:
1: ORDER BY is ignored in subqueries without LIMIT
ℹ Do you need to move arrange() later in the pipeline or use window_order() instead? 
2: ORDER BY is ignored in subqueries without LIMIT
ℹ Do you need to move arrange() later in the pipeline or use window_order() instead? 
3: ORDER BY is ignored in subqueries without LIMIT
ℹ Do you need to move arrange() later in the pipeline or use window_order() instead? 
4: ORDER BY is ignored in subqueries without LIMIT
ℹ Do you need to move arrange() later in the pipeline or use window_order() instead? 
5: ORDER BY is ignored in subqueries without LIMIT
ℹ Do you need to move arrange() later in the pipeline or use window_order() instead?

We are getting some weird warnings. Also, this warning is shown everywhere multiple times.

mzorko commented 3 years ago

In addition to last post, this is the query:

image

Note:

also, in addition, functions like sparklyr::fill() are not working with arrange() function (ignoring it).

yitao-li commented 3 years ago

@mzorko I don't think the warning is valid for distinct().

The version of distinct() implemented by sparklyr meets the following requirements for dplyr::distinct() (see `?dplyr::distinct for details):

    • Rows are a subset of the input but appear in the same order.
    • Columns are not modified if ‘...’ is empty or ‘.keep_all’ is
      ‘TRUE’. Otherwise, ‘distinct()’ first calls ‘mutate()’ to
      create new columns.
    • Groups are not modified.
    • Data frame attributes are preserved.

So, it does more than just inserting the distinct keyword into the Spark SQL query, which explains why you saw a moderately complicated query for distinct()

However, in sparklyr 1.6, you can disable the ordering behavior with the following: options(sparklyr.dplyr_distinct.impl = "tbl_lazy"). Doing so will cause the version of distinct() implemented by sparklyr to not stick to the dplyr::distinct() constract and just output unique rows in arbitrary order (which could be OK in many use cases), and also you won't see those warnings once this option is set.

yitao-li commented 3 years ago

Also see https://github.com/tidyverse/dbplyr/issues/611 for discussion re: this warning.

mzorko commented 3 years ago

Thanks a lot for quick answer, will look through that right away.

yitao-li commented 3 years ago

@mzorko Hey thanks for catching the ordering bug in fill.tbl_spark(). I think https://github.com/sparklyr/sparklyr/pull/2989 should fix it.