saurfang / spark-sas7bdat

Splittable SAS (.sas7bdat) Input Format for Hadoop and Spark SQL
http://spark-packages.org/package/saurfang/spark-sas7bdat
Apache License 2.0
89 stars 40 forks source link

Hive SQL functions not registered when called through sparklyr #39

Closed JordanCuevas closed 5 years ago

JordanCuevas commented 5 years ago

I'm trying to use Hive date functions in sparklyr::sdf_sql to manipulate some data, however some of these return errors that the function is not registered in the database. This only occurs after installation of spark-sas7bdat on the cluster. Note that I've duplicated this issue with sparklyr as I'm not sure which team would own this. Reproducible example below:


>library(sparklyr)
>library(dplyr)
Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

>sc <- spark_connect(method="databricks")
>dat <- data.frame(person=rep(c(1:3),3), measure=rnorm(9))
>src_tbls(sc)
[1] "netprice_092018"        "netprice_42018"         "test_netprice_external"
[4] "test_table"  

>dat <- data.frame(person=rep(c(1:3),3), measure=rnorm(9))
>dat_sparkly <- copy_to(sc, dat, "dat_sparkly") #Gives error, but "dat_sparkly" is sent to Spark (see next command). Same root cause as other errors below?
Error : org.apache.spark.sql.AnalysisException: Undefined function: 'count'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; line 1 pos 7 (NOTE: If you wish to use SparkR, import it by calling 'library(SparkR)'.)

#copy_to gives error, however table is correctly sent to Spark:
>src_tbls(sc)
[1] "dat_sparkly"            "netprice_092018"        "netprice_42018"        
[4] "test_netprice_external" "test_table"

>sdf_sql(sc, "select * from dat_sparkly") #Works
# Source: spark [?? x 2]
  person measure
*     
1      1  -0.354
2      2  -0.197
3      3  -0.747
4      1   0.118
5      2  -0.742
6      3  -0.430
7      1  -2.55 
8      2   0.886
9      3  -0.713

>sdf_sql(sc, "select current_date from dat_sparkly") #Works
# Source: spark [?? x 1]
  `current_date()`
*           
1 2018-10-12      
2 2018-10-12      
3 2018-10-12      
4 2018-10-12      
5 2018-10-12      
6 2018-10-12      
7 2018-10-12      
8 2018-10-12      
9 2018-10-12 

>sdf_sql(sc, "select date_format(current_date,'E') as week from dat_sparkly") #FAILS

Error : org.apache.spark.sql.AnalysisException: Undefined function: 'date_format'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; line 1 pos 7

Session info below:

>devtools::session_info()
Session info ------------------------------------------------------------------
Packages ----------------------------------------------------------------------
 setting  value                       
 version  R version 3.4.4 (2018-03-15)
 system   x86_64, linux-gnu           
 ui       X11                         
 language (EN)                        
 collate  en_US.UTF-8                 
 tz       Etc/UTC                     
 date     2018-10-12                  

 package       * version date       source        
 assertthat      0.2.0   2017-04-11 CRAN (R 3.4.4)
 backports       1.1.2   2017-12-13 CRAN (R 3.4.4)
 base          * 3.4.4   2018-03-16 local         
 base64enc       0.1-3   2015-07-28 CRAN (R 3.4.4)
 bindr           0.1.1   2018-03-13 CRAN (R 3.4.4)
 bindrcpp        0.2.2   2018-03-29 CRAN (R 3.4.4)
 broom           0.4.4   2018-03-29 CRAN (R 3.4.4)
 cli             1.0.0   2017-11-05 CRAN (R 3.4.4)
 compiler        3.4.4   2018-03-16 local         
 config          0.3     2018-03-27 CRAN (R 3.4.4)
 crayon          1.3.4   2017-09-16 CRAN (R 3.4.4)
 datasets      * 3.4.4   2018-03-16 local         
 DBI             0.8     2018-03-02 CRAN (R 3.4.4)
 dbplyr          1.2.2   2018-07-25 CRAN (R 3.4.4)
 devtools        1.13.5  2018-02-18 CRAN (R 3.4.4)
 digest          0.6.15  2018-01-28 CRAN (R 3.4.4)
 dplyr         * 0.7.4   2017-09-28 CRAN (R 3.4.4)
 foreign         0.8-70  2018-04-23 CRAN (R 3.4.4)
 forge           0.1.0   2018-08-31 CRAN (R 3.4.4)
 glue            1.2.0   2017-10-29 CRAN (R 3.4.4)
 graphics      * 3.4.4   2018-03-16 local         
 grDevices     * 3.4.4   2018-03-16 local         
 grid            3.4.4   2018-03-16 local         
 htmltools       0.3.6   2017-04-28 CRAN (R 3.4.4)
 htmlwidgets     1.3     2018-09-30 CRAN (R 3.4.4)
 httpuv          1.4.5   2018-07-19 CRAN (R 3.4.4)
 httr            1.3.1   2017-08-20 CRAN (R 3.4.4)
 hwriter         1.3.2   2014-09-10 CRAN (R 3.4.4)
 hwriterPlus     1.0-3   2015-01-05 CRAN (R 3.4.4)
 jsonlite        1.5     2017-06-01 CRAN (R 3.4.4)
 later           0.7.5   2018-09-18 CRAN (R 3.4.4)
 lattice         0.20-35 2017-03-25 CRAN (R 3.3.3)
 lazyeval        0.2.1   2017-10-29 CRAN (R 3.4.4)
 magrittr        1.5     2014-11-22 CRAN (R 3.4.4)
 memoise         1.1.0   2017-04-21 CRAN (R 3.4.4)
 methods       * 3.4.4   2018-03-16 local         
 mime            0.5     2016-07-07 CRAN (R 3.4.4)
 mnormt          1.5-5   2016-10-15 CRAN (R 3.4.4)
 nlme            3.1-137 2018-04-07 CRAN (R 3.4.4)
 parallel        3.4.4   2018-03-16 local         
 pillar          1.2.1   2018-02-27 CRAN (R 3.4.4)
 pkgconfig       2.0.1   2017-03-21 CRAN (R 3.4.4)
 plyr            1.8.4   2016-06-08 CRAN (R 3.4.4)
 promises        1.0.1   2018-04-13 CRAN (R 3.4.4)
 psych           1.8.3.3 2018-03-30 CRAN (R 3.4.4)
 purrr           0.2.4   2017-10-18 CRAN (R 3.4.4)
 r2d3            0.2.2   2018-05-30 CRAN (R 3.4.4)
 R6              2.2.2   2017-06-17 CRAN (R 3.4.4)
 Rcpp            0.12.16 2018-03-13 CRAN (R 3.4.4)
 reshape2        1.4.3   2017-12-11 CRAN (R 3.4.4)
 rlang           0.2.0   2018-02-20 CRAN (R 3.4.4)
 rprojroot       1.3-2   2018-01-03 CRAN (R 3.4.4)
 Rserve          1.7-3   2013-08-21 CRAN (R 3.4.4)
 rstudioapi      0.7     2017-09-07 CRAN (R 3.4.4)
 shiny           1.1.0   2018-05-17 CRAN (R 3.4.4)
 sparklyr      * 0.9.1   2018-09-27 CRAN (R 3.4.4)
 SparkR          2.3.1   2018-10-12 local         
 stats         * 3.4.4   2018-03-16 local         
 stringi         1.1.7   2018-03-12 CRAN (R 3.4.4)
 stringr         1.3.0   2018-02-19 CRAN (R 3.4.4)
 TeachingDemos   2.10    2016-02-12 CRAN (R 3.4.4)
 tibble          1.4.2   2018-01-22 CRAN (R 3.4.4)
 tidyr           0.8.0   2018-01-29 CRAN (R 3.4.4)
 tools           3.4.4   2018-03-16 local         
 utf8            1.1.3   2018-01-03 CRAN (R 3.4.4)
 utils         * 3.4.4   2018-03-16 local         
 withr           2.1.2   2018-03-15 CRAN (R 3.4.4)
 xtable          1.8-3   2018-08-29 CRAN (R 3.4.4)
 yaml            2.2.0   2018-07-25 CRAN (R 3.4.4)
thesuperzapper commented 5 years ago

@JordanCuevas can you confirm that you still have this issue with 2.1

JordanCuevas commented 5 years ago

@JordanCuevas can you confirm that you still have this issue with 2.1

Interestingly, we uninstalled a big query package that was installed on the same cluster, after which sparklyr has been working as expected even when sas7bdat was also installed.