mlr-org / farff

a faster arff parser
Other
11 stars 6 forks source link

readARFF with a high cardinality factor does not work if the labels are too long #37

Open FlorianPargent opened 6 years ago

FlorianPargent commented 6 years ago
library(farff)
library(stringi)

set.seed(1)

n = 2000000
n_levels = 25000
label_length = 30

fac_levels = stri_rand_strings(n = n_levels, length = label_length)

# a high cardinality factor with "long" labels
dat1 = data.frame(huge_factor = factor(sample(fac_levels, size = n, replace = TRUE)))

# make the labels as short as possible
dat2 = dat1
levels(dat2$huge_factor) = abbreviate(fac_levels, minlength = 1)

# write arff files (successful for both!)
writeARFF(dat1, path = "datafile1.arff")
writeARFF(dat2, path = "datafile2.arff")

# reading the long label version takes a very long time and breaks in a strange way
dat3 = readARFF("datafile1.arff")
all.equal(dat1, dat3)

# the short label version works fine
dat4 = readARFF("datafile2.arff")
all.equal(dat2, dat4)

Sidenote: This leads to errors when working with OpenML which are hard to debug, as the dataset can be uploaded with the R-Interface without error but then the download fails (or in one case I had, seems to be caught in an infinite loop).

> devtools::session_info()
Session info ----------------------------------------------------------------------------------------------------------------------------------------
 setting  value                       
 version  R version 3.5.1 (2018-07-02)
 system   x86_64, darwin15.6.0        
 ui       RStudio (1.1.456)           
 language (EN)                        
 collate  de_DE.UTF-8                 
 tz       Europe/Berlin               
 date     2018-11-08                  

Packages --------------------------------------------------------------------------------------------------------------------------------------------
 package    * version date       source                             
 assertthat   0.2.0   2017-04-11 CRAN (R 3.5.0)                     
 backports    1.1.2   2017-12-13 CRAN (R 3.5.0)                     
 base       * 3.5.1   2018-07-05 local                              
 BBmisc       1.11    2018-11-07 Github (berndbischl/BBmisc@a5a4e45)
 checkmate    1.8.5   2017-10-24 CRAN (R 3.5.0)                     
 cli          1.0.1   2018-09-25 CRAN (R 3.5.0)                     
 compiler     3.5.1   2018-07-05 local                              
 crayon       1.3.4   2017-09-16 CRAN (R 3.5.0)                     
 data.table   1.11.8  2018-09-30 CRAN (R 3.5.0)                     
 datasets   * 3.5.1   2018-07-05 local                              
 devtools     1.13.6  2018-06-27 CRAN (R 3.5.0)                     
 digest       0.6.18  2018-10-10 CRAN (R 3.5.0)                     
 fansi        0.4.0   2018-10-05 CRAN (R 3.5.0)                     
 farff      * 1.0     2018-10-30 Github (mlr-org/farff@2e911b7)     
 graphics   * 3.5.1   2018-07-05 local                              
 grDevices  * 3.5.1   2018-07-05 local                              
 hms          0.4.2   2018-03-10 CRAN (R 3.5.0)                     
 memoise      1.1.0   2017-04-21 CRAN (R 3.5.0)                     
 methods    * 3.5.1   2018-07-05 local                              
 pillar       1.3.0   2018-07-14 CRAN (R 3.5.0)                     
 pkgconfig    2.0.2   2018-08-16 CRAN (R 3.5.0)                     
 R6           2.3.0   2018-10-04 CRAN (R 3.5.0)                     
 Rcpp         0.12.19 2018-10-01 CRAN (R 3.5.0)                     
 readr      * 1.1.1   2017-05-16 CRAN (R 3.5.0)                     
 rlang        0.3.0.1 2018-10-25 cran (@0.3.0.1)                    
 rstudioapi   0.8     2018-10-02 CRAN (R 3.5.0)                     
 stats      * 3.5.1   2018-07-05 local                              
 stringi    * 1.2.4   2018-07-20 CRAN (R 3.5.0)                     
 tibble       1.4.2   2018-01-22 CRAN (R 3.5.0)                     
 tools        3.5.1   2018-07-05 local                              
 utf8         1.1.4   2018-05-24 CRAN (R 3.5.0)                     
 utils      * 3.5.1   2018-07-05 local                              
 withr        2.1.2   2018-03-15 CRAN (R 3.5.0)                     
 yaml         2.2.0   2018-07-25 CRAN (R 3.5.0)
FlorianPargent commented 5 years ago

Unfortunately, the last fix does not seem to be enough:

> dat3 = readARFF("datafile1.arff")
Parse with reader=readr : datafile1.arff
Loading required package: readr
Warnung: 1 parsing failure.
row # A tibble: 1 x 5 col     row col   expected  actual      file                                                                            expected   <int> <chr> <chr>     <chr>       <chr>                                                                           actual 1     1 NA    1 columns 759 columns '/var/folders/t5/8s0vv3w545v7x5j0_pqtc8wr0000gp/T//Rtmpcmkgef/file475535af75d5' file # A tibble: 1 x 5

header: 114.905000; preproc: 0.504000; data: 0.845000; postproc: 0.096000; total: 116.350000
Warnmeldungen:
1: Unnamed `col_types` should have the same length as `col_names`. Using smaller of the two. 
2: In rbind(names(probs), probs_f) :
  number of columns of result is not a multiple of vector length (arg 2)
> all.equal(dat1, dat3)
 [1] "Attributes: < Names: 1 string mismatch >"                                                    
 [2] "Attributes: < Length mismatch: comparison on first 2 components >"                           
 [3] "Attributes: < Component 2: Modes: numeric, list >"                                           
 [4] "Attributes: < Component 2: Lengths: 2000000, 5 >"                                            
 [5] "Attributes: < Component 2: names for current but not for target >"                           
 [6] "Attributes: < Component 2: Attributes: < Ziel ist NULL, aktuell ist list > >"                
 [7] "Attributes: < Component 2: target is numeric, current is tbl_df >"                           
 [8] "Component “huge_factor”: Lengths: 2000000, 2000002"                                          
 [9] "Component “huge_factor”: Lengths (2000000, 2000002) differ (string compare on first 2000000)"
[10] "Component “huge_factor”: 'is.NA' value mismatch: 2 in current 0 in target"                   

Now, a dataframe is returned but the number of rows do not match. It seems like two empty rows are added at the beginning of the dataframe:

> dim(dat1)
[1] 2000000       1
> dim(dat3)
[1] 2000002       1
>
> head(dat1)
                     huge_factor
1 6GwiqtKZwCEVtO4wpTeqK58HKKsgMc
2 9jc6lV3by0tkHv8UUBtv1p30baKu6z
3 rpF65yg5DY3sHk5mnRbWKVHR03lA3S
4 8uZpJsDm7WI13zFYoUD6obcLeG0I1Z
5 KZti0i9paE3iB0umaC46x1pN3GPzQ7
6 7xfDZa1ug3we4cKNmE5p6JwUZwdmSg
>
> head(dat3)
                     huge_factor
1                           <NA>
2                           <NA>
3 6GwiqtKZwCEVtO4wpTeqK58HKKsgMc
4 9jc6lV3by0tkHv8UUBtv1p30baKu6z
5 rpF65yg5DY3sHk5mnRbWKVHR03lA3S
6 8uZpJsDm7WI13zFYoUD6obcLeG0I1Z
FlorianPargent commented 5 years ago
> devtools::session_info()
Session info ----------------------------------------------------------------------------------------------------------------------------
 setting  value                       
 version  R version 3.5.1 (2018-07-02)
 system   x86_64, darwin15.6.0        
 ui       RStudio (1.1.456)           
 language (EN)                        
 collate  de_DE.UTF-8                 
 tz       Europe/Berlin               
 date     2018-11-20                  

Packages --------------------------------------------------------------------------------------------------------------------------------
 package    * version date       source                             
 assertthat   0.2.0   2017-04-11 CRAN (R 3.5.0)                     
 backports    1.1.2   2017-12-13 CRAN (R 3.5.0)                     
 base       * 3.5.1   2018-07-05 local                              
 BBmisc       1.11    2018-11-07 Github (berndbischl/BBmisc@a5a4e45)
 checkmate    1.8.5   2017-10-24 CRAN (R 3.5.0)                     
 cli          1.0.1   2018-09-25 CRAN (R 3.5.0)                     
 compiler     3.5.1   2018-07-05 local                              
 crayon       1.3.4   2017-09-16 CRAN (R 3.5.0)                     
 data.table   1.11.8  2018-09-30 CRAN (R 3.5.0)                     
 datasets   * 3.5.1   2018-07-05 local                              
 devtools     1.13.6  2018-06-27 CRAN (R 3.5.0)                     
 digest       0.6.18  2018-10-10 CRAN (R 3.5.0)                     
 fansi        0.4.0   2018-10-05 CRAN (R 3.5.0)                     
 farff      * 1.0     2018-11-20 Github (mlr-org/farff@8221efb)     
 graphics   * 3.5.1   2018-07-05 local                              
 grDevices  * 3.5.1   2018-07-05 local                              
 hms          0.4.2   2018-03-10 CRAN (R 3.5.0)                     
 memoise      1.1.0   2017-04-21 CRAN (R 3.5.0)                     
 methods    * 3.5.1   2018-07-05 local                              
 pillar       1.3.0   2018-07-14 CRAN (R 3.5.0)                     
 pkgconfig    2.0.2   2018-08-16 CRAN (R 3.5.0)                     
 R6           2.3.0   2018-10-04 CRAN (R 3.5.0)                     
 Rcpp         0.12.19 2018-10-01 CRAN (R 3.5.0)                     
 readr      * 1.1.1   2017-05-16 CRAN (R 3.5.0)                     
 rlang        0.3.0.1 2018-10-25 cran (@0.3.0.1)                    
 rstudioapi   0.8     2018-10-02 CRAN (R 3.5.0)                     
 stats      * 3.5.1   2018-07-05 local                              
 stringi    * 1.2.4   2018-07-20 CRAN (R 3.5.0)                     
 tibble       1.4.2   2018-01-22 CRAN (R 3.5.0)                     
 tools        3.5.1   2018-07-05 local                              
 utf8         1.1.4   2018-05-24 CRAN (R 3.5.0)                     
 utils      * 3.5.1   2018-07-05 local                              
 withr        2.1.2   2018-03-15 CRAN (R 3.5.0)                     
 yaml         2.2.0   2018-07-25 CRAN (R 3.5.0)