openml / openml-r

R package to interface with OpenML
http://openml.github.io/openml-r/
Other
95 stars 37 forks source link

when downloading KDD98 then uploading it without changes, parsing the new dataset fails #426

Closed FlorianPargent closed 5 years ago

FlorianPargent commented 5 years ago

I encountered this problem when trying to upload another version of the KDD98 dataset (id = 23513), in which the binary target is correctly coded as factor instead of numeric. Interestingly, downloading the dataset and uploading it again without changes works, but downloading the new dataset does not work as the parsing fails.

library(OpenML)

loadOMLConfig(path = "~/.openml/config", assign = TRUE)

# download KDD98
oml_dat = getOMLDataSet(23513)
dat = oml_dat$data

# fix target variable
# dat$TARGET_B = factor(dat$TARGET_B)

new_desc = makeOMLDataSetDescription(
  name = "KDD98-test-bug",
  description = "this is just to test a bug",
  default.target.attribute = oml_dat$desc$default.target.attribute
)

new_oml_dat = makeOMLDataSet(
  desc = new_desc,
  data = dat,
  colnames.old = colnames(dat),
  colnames.new = colnames(dat),
  target.features = new_desc$default.target.attribute
)

id = uploadOMLDataSet(new_oml_dat, confirm.upload = FALSE)
getOMLDataSet(id)
deleteOMLObject(id, object = "data")

I get the following error:

Downloading from 'http://www.openml.org/api/v1/data/41288' to '/var/folders/t5/8s0vv3w545v7x5j0_pqtc8wr0000gp/T//RtmpiZA78C/cache/datasets/41288/description.xml'.
Downloading from 'https://www.openml.org/data/v1/download/20649269/KDD98-test-bug.arff' to '/var/folders/t5/8s0vv3w545v7x5j0_pqtc8wr0000gp/T//RtmpiZA78C/cache/datasets/41288/dataset.arff'
Warnung in getOMLDataSetById(data.id = data.id, cache.only = cache.only, 
  Data set is in preparation and will be activated soon.
Warnung: 191261 parsing failures.
row # A tibble: 5 x 5 col     row col   expected  actual      file                                                                           expected   <int> <chr> <chr>     <chr>       <chr>                                                                          actual 1     1 X1    a double  @data       '/var/folders/t5/8s0vv3w545v7x5j0_pqtc8wr0000gp/T//RtmpK26iPr/file276d88c0482' file 2     2 NA    1 columns 479 columns '/var/folders/t5/8s0vv3w545v7x5j0_pqtc8wr0000gp/T//RtmpK26iPr/file276d88c0482' row 3     3 NA    1 columns 479 columns '/var/folders/t5/8s0vv3w545v7x5j0_pqtc8wr0000gp/T//RtmpK26iPr/file276d88c0482' col 4     4 NA    1 columns 479 columns '/var/folders/t5/8s0vv3w545v7x5j0_pqtc8wr0000gp/T//RtmpK26iPr/file276d88c0 [... abgeschnitten]
Fehler in names(x) <- value : 
  Attribut 'names' [479] muss dieselbe Länge haben wie der Vektor [1]
Zusätzlich: Warnmeldungen:
1: Unnamed `col_types` should have the same length as `col_names`. Using smaller of the two. 
2: In rbind(names(probs), probs_f) :
  number of columns of result is not a multiple of vector length (arg 1)

To me, the arff files of both datasets look very similar but I am not an expert on that.

Also, I downloaded the arff and csv files from the OpenML homepage and tried to read them manually. It works for the csv but not for the arff.

My first intuition was that this might be related to my other issue here https://github.com/mlr-org/farff/issues/37, but I do not really see how...


> devtools::session_info()
Session info -------------------------------------------------------------------------------------------------
 setting  value                       
 version  R version 3.5.1 (2018-07-02)
 system   x86_64, darwin15.6.0        
 ui       RStudio (1.1.456)           
 language (EN)                        
 collate  de_DE.UTF-8                 
 tz       Europe/Berlin               
 date     2018-11-16                  

Packages -----------------------------------------------------------------------------------------------------
 package      * version   date       source                                   
 assertthat     0.2.0     2017-04-11 CRAN (R 3.5.0)                           
 backports      1.1.2     2017-12-13 CRAN (R 3.5.0)                           
 base         * 3.5.1     2018-07-05 local                                    
 BBmisc         1.11      2018-11-07 Github (berndbischl/BBmisc@a5a4e45)      
 bindr          0.1.1     2018-03-13 CRAN (R 3.5.0)                           
 bindrcpp       0.2.2     2018-03-29 CRAN (R 3.5.0)                           
 checkmate      1.8.5     2017-10-24 CRAN (R 3.5.0)                           
 colorspace     1.3-2     2016-12-14 CRAN (R 3.5.0)                           
 compiler       3.5.1     2018-07-05 local                                    
 crayon         1.3.4     2017-09-16 CRAN (R 3.5.0)                           
 curl           3.2       2018-03-28 CRAN (R 3.5.0)                           
 data.table     1.11.8    2018-09-30 CRAN (R 3.5.0)                           
 datasets     * 3.5.1     2018-07-05 local                                    
 devtools       1.13.6    2018-06-27 CRAN (R 3.5.0)                           
 digest         0.6.18    2018-10-10 CRAN (R 3.5.0)                           
 dplyr          0.7.7     2018-10-16 CRAN (R 3.5.0)                           
 farff          1.0       2018-10-30 Github (mlr-org/farff@2e911b7)           
 fastmatch      1.1-0     2017-01-28 CRAN (R 3.5.0)                           
 ggplot2        3.1.0     2018-10-25 cran (@3.1.0)                            
 glue           1.3.0     2018-07-17 CRAN (R 3.5.0)                           
 graphics     * 3.5.1     2018-07-05 local                                    
 grDevices    * 3.5.1     2018-07-05 local                                    
 grid           3.5.1     2018-07-05 local                                    
 gtable         0.2.0     2016-02-26 CRAN (R 3.5.0)                           
 hms            0.4.2     2018-03-10 CRAN (R 3.5.0)                           
 httr           1.3.1     2017-08-20 CRAN (R 3.5.0)                           
 jsonlite       1.5       2017-06-01 CRAN (R 3.5.0)                           
 lattice        0.20-35   2017-03-25 CRAN (R 3.5.1)                           
 lazyeval       0.2.1     2017-10-29 CRAN (R 3.5.0)                           
 magrittr       1.5       2014-11-22 CRAN (R 3.5.0)                           
 Matrix         1.2-14    2018-04-13 CRAN (R 3.5.1)                           
 memoise        1.1.0     2017-04-21 CRAN (R 3.5.0)                           
 methods      * 3.5.1     2018-07-05 local                                    
 mlr          * 2.13.9000 2018-11-15 Github (mlr-org/mlr@ae82e77)             
 munsell        0.5.0     2018-06-12 CRAN (R 3.5.0)                           
 OpenML       * 1.9       2018-11-15 Github (openml/openml-r@e07e1a1)         
 parallel       3.5.1     2018-07-05 local                                    
 parallelMap    1.4       2018-11-07 Github (berndbischl/parallelMap@101b91d) 
 ParamHelpers * 1.11      2018-11-07 Github (berndbischl/ParamHelpers@0516926)
 pillar         1.3.0     2018-07-14 CRAN (R 3.5.0)                           
 pkgconfig      2.0.2     2018-08-16 CRAN (R 3.5.0)                           
 plyr           1.8.4     2016-06-08 CRAN (R 3.5.0)                           
 purrr          0.2.5     2018-05-29 CRAN (R 3.5.0)                           
 R6             2.3.0     2018-10-04 CRAN (R 3.5.0)                           
 Rcpp           0.12.19   2018-10-01 CRAN (R 3.5.0)                           
 readr        * 1.1.1     2017-05-16 CRAN (R 3.5.0)                           
 rlang          0.3.0.1   2018-10-25 cran (@0.3.0.1)                          
 rstudioapi     0.8       2018-10-02 CRAN (R 3.5.0)                           
 scales         1.0.0     2018-08-09 CRAN (R 3.5.0)                           
 splines        3.5.1     2018-07-05 local                                    
 stats        * 3.5.1     2018-07-05 local                                    
 stringi        1.2.4     2018-07-20 CRAN (R 3.5.0)                           
 survival       2.42-3    2018-04-16 CRAN (R 3.5.1)                           
 tibble         1.4.2     2018-01-22 CRAN (R 3.5.0)                           
 tidyselect     0.2.5     2018-10-11 CRAN (R 3.5.0)                           
 tools          3.5.1     2018-07-05 local                                    
 utils        * 3.5.1     2018-07-05 local                                    
 withr          2.1.2     2018-03-15 CRAN (R 3.5.0)                           
 XML            3.98-1.16 2018-08-19 CRAN (R 3.5.0)                           
 yaml           2.2.0     2018-07-25 CRAN (R 3.5.0)```
FlorianPargent commented 5 years ago

just to make it clear, it also does not work when I change the uploaded dataset by uncommenting the originally intended line: dat$TARGET_B = factor(dat$TARGET_B)

giuseppec commented 5 years ago

Does this also happen if you use the ARFF reader from RWeka instead of farff? And could you please try another ARFF reader, e.g. the read.arff function from the foreign package? The data set is so big that it takes to long on my laptop to do a "quick check" here. And: What happens if you upload just a smaller subset of the data? Could you try this out (without deleting the data sets you upload so that I could check this quicker)

FlorianPargent commented 5 years ago

41290 is the big version while 41291 is a small version with only 10000 rows. Both fail with readARFF. Will try the other functions next.

FlorianPargent commented 5 years ago

Ok sorry, it seems to be a problem with farff as read.arff in the foreign package works also for the big one.

giuseppec commented 5 years ago

Will close this here since using RWeka instead of farff with the OpenML package seems to work. Just reopen if you encounter any other problem or think that I should include the read.arff function from the foreign package as a third arff reader option in the OpenML package. But for now it seems to work if you do this:

setOMLConfig(arff.reader = "RWeka") # you can also set RWeka as default in your config file
d = getOMLDataSet(41291)

Still, it would be great to concretize what exactly causes farff failing since farff was designed to behave exactly like RWeka. Could you please open an issue in the farff tracker for this issue? If I have time and understand the issue I could maybe make a fix on farff. Otherwise you have to force Bernd to look at this (or try to fix it yourself and make a PR on farff, which is probably easier than forcing Bernd ;) ).

FlorianPargent commented 5 years ago

just for completeness: with the latest change in farff, this example works now.