Splittable SAS (.sas7bdat) Input Format for Hadoop and Spark SQL
Hive SQL functions not registered when called through sparklyr #39

I'm trying to use Hive date functions in sparklyr::sdf_sql to manipulate some data, however some of these return errors that the function is not registered in the database. This only occurs after installation of spark-sas7bdat on the cluster. Note that I've duplicated this issue with sparklyr as I'm not sure which team would own this. Reproducible example below:

Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

>sc <- spark_connect(method="databricks")
>dat <- data.frame(person=rep(c(1:3),3), measure=rnorm(9))
[1] "netprice_092018"        "netprice_42018"         "test_netprice_external"
[4] "test_table"  

>dat <- data.frame(person=rep(c(1:3),3), measure=rnorm(9))
>dat_sparkly <- copy_to(sc, dat, "dat_sparkly") #Gives error, but "dat_sparkly" is sent to Spark (see next command). Same root cause as other errors below?
Error : org.apache.spark.sql.AnalysisException: Undefined function: 'count'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; line 1 pos 7 (NOTE: If you wish to use SparkR, import it by calling 'library(SparkR)'.)

#copy_to gives error, however table is correctly sent to Spark:
[1] "dat_sparkly"            "netprice_092018"        "netprice_42018"        
[4] "test_netprice_external" "test_table"

>sdf_sql(sc, "select * from dat_sparkly") #Works
# Source: spark [?? x 2]
  person measure
1      1  -0.354
2      2  -0.197
3      3  -0.747
4      1   0.118
5      2  -0.742
6      3  -0.430
7      1  -2.55 
8      2   0.886
9      3  -0.713

>sdf_sql(sc, "select current_date from dat_sparkly") #Works
# Source: spark [?? x 1]
1 2018-10-12      
2 2018-10-12      
3 2018-10-12      
4 2018-10-12      
5 2018-10-12      
6 2018-10-12      
7 2018-10-12      
8 2018-10-12      
9 2018-10-12 

>sdf_sql(sc, "select date_format(current_date,'E') as week from dat_sparkly") #FAILS

Error : org.apache.spark.sql.AnalysisException: Undefined function: 'date_format'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; line 1 pos 7

Session info below:

thesuperzapper commented 5 years ago

@JordanCuevas can you confirm that you still have this issue with 2.1

JordanCuevas commented 5 years ago

Interestingly, we uninstalled a big query package that was installed on the same cluster, after which sparklyr has been working as expected even when sas7bdat was also installed.