Open yitao-li opened 3 years ago
Hi, what is the status with this?
library(sparklyr)
library(dplyr)
sc <- sparklyr::spark_connect("local", method = "shell", version = "3.0.0")
iris_tbl <- iris %>% sparklyr::sdf_copy_to(sc, ., sparklyr::random_string())
iris_tbl %>% sparklyr::distinct(Species)
# Source: spark<?> [?? x 1]
Species
<chr>
1 setosa
2 versicolor
3 virginica
Warning messages:
1: ORDER BY is ignored in subqueries without LIMIT
ℹ Do you need to move arrange() later in the pipeline or use window_order() instead?
2: ORDER BY is ignored in subqueries without LIMIT
ℹ Do you need to move arrange() later in the pipeline or use window_order() instead?
3: ORDER BY is ignored in subqueries without LIMIT
ℹ Do you need to move arrange() later in the pipeline or use window_order() instead?
4: ORDER BY is ignored in subqueries without LIMIT
ℹ Do you need to move arrange() later in the pipeline or use window_order() instead?
5: ORDER BY is ignored in subqueries without LIMIT
ℹ Do you need to move arrange() later in the pipeline or use window_order() instead?
We are getting some weird warnings. Also, this warning is shown everywhere multiple times.
In addition to last post, this is the query:
Note:
also, in addition, functions like sparklyr::fill()
are not working with arrange()
function (ignoring it).
@mzorko I don't think the warning is valid for distinct()
.
The version of distinct()
implemented by sparklyr
meets the following requirements for dplyr::distinct()
(see `?dplyr::distinct for details):
• Rows are a subset of the input but appear in the same order.
• Columns are not modified if ‘...’ is empty or ‘.keep_all’ is
‘TRUE’. Otherwise, ‘distinct()’ first calls ‘mutate()’ to
create new columns.
• Groups are not modified.
• Data frame attributes are preserved.
So, it does more than just inserting the distinct
keyword into the Spark SQL query, which explains why you saw a moderately complicated query for distinct()
However, in sparklyr 1.6, you can disable the ordering behavior with the following: options(sparklyr.dplyr_distinct.impl = "tbl_lazy")
. Doing so will cause the version of distinct()
implemented by sparklyr to not stick to the dplyr::distinct()
constract and just output unique rows in arbitrary order (which could be OK in many use cases), and also you won't see those warnings once this option is set.
Also see https://github.com/tidyverse/dbplyr/issues/611 for discussion re: this warning.
Thanks a lot for quick answer, will look through that right away.
@mzorko Hey thanks for catching the ordering bug in fill.tbl_spark()
. I think https://github.com/sparklyr/sparklyr/pull/2989 should fix it.
appears to be related to changes in dbplyr 2.0 --
dplyr::pull(x)
is no longer order-preserving afterdplyr::arrange(x)