Closed jonkeane closed 2 years ago
@jonkeane Is there somewhere specific to document these or is that part of the task?
We could upstream them as issues to duckdb if the problem is duckdb, tho it might take a decent amount of digging to confirm that our dplyr implementations aren't doing something maliciously wrong (though they all work with arrow!)
So the following tpc-h queries fail for DuckDB: 2, 9, 13, 14, 16, 20 and 21.
For everything except query 21, the issue is the use of grepl
which works in arrow but not in duckdb (and indeed most DBI backends).
For 21 it is this use of any
here.
So for this, one fix is an upstream dbplyr/duckdb implementation. Another (shorter term) idea is to modify the valid parameters to exclude running the above queries.
Oh, this is fantastic, thanks for all this detail!
I wonder if there's something other than grepl
we could use that would get translated correctly?
And for the any
could we do sum(...) > 1
? That might work just fine in arrow|dplyr and duckdb too
And for the any could we do sum(...) > 1? That might work just fine in arrow|dplyr and duckdb too
This was pretty easy to solve.
I actually don't think we have anything easy for grepl
that is available in both R and arrow. I opened an issue in dbplyr but so as not to assume anything there, I wonder if the best path is to skip those benchmarks for now?
The other path that seems clunky is to run either grepl
or %like%
depending on the engine. But that would add quite a bit of additional code and potential maintenance.
nods yeah. This might be a bit odd, but one thing we could try temporarily is to add a grep_func
argument to the query functions https://github.com/ursacomputing/arrowbench/blob/4022c6e37a8aa2eebd579a8d710ee5e05c1c7372/R/tpch-queries.R#L8 and then have that be grep_func = grepl
for non-duckdb cases and grep_func = `%like%`
for duckdb and see if that works?
Ultimately, it's ok if we know the issue is grepl and translating that into duckdb's sql in some way — we can run the SQL-based queries for duckdb, which is better anyway!
The queries all should work (since they use dplyr pipelines), though some of them error:
One can do so with: