Some OpenML datasets can't be parsed

larskotthoff commented 8 years ago

I get

Rscript -e 'library(farff); readARFF("1112.arff")' Parse with reader=readr : 1112.arff Loading required package: readr Warning: 50001 parsing failures. row col expected actual 1 X1 a double @data
2 -- 1 columns 231 columns 3 -- 1 columns 231 columns 4 -- 1 columns 231 columns 5 -- 1 columns 231 columns ... ... ......... ........... .See problems(...) for more details. Error in colnames<-(*tmp*, value = header$col.names) : 'names' attribute [231] must be the same length as the vector [1] Calls: readARFF -> colnames<- In addition: Warning message: Unnamed col_types should have the same length as col_names. Using smaller of the two. Execution halted

Joaquin says:

Alright, after some experiments I found that the problem goes away if I remove the features with have more than 15000 (nominal) values.

Maybe farff raises an internal error when it encounters such cases and skips them, and hence the feature count won't match, which would explain the error we see.

It happens for 1111,1112 and 1114.

jakobbossek commented 8 years ago

We do preprocessing in C. There we skip the header lines to move on to the @data section. The line buffer reserved to save the skipped lines is too low for the large lines in the reported arff files. Going to fix this.

jakobbossek commented 8 years ago

Fixed.

mlr-org / farff

Some OpenML datasets can't be parsed #23