mlr-org / farff

a faster arff parser
Other
11 stars 6 forks source link

columns containing question marks #18

Closed giuseppec closed 8 years ago

giuseppec commented 8 years ago

There seems to be a problem when a column contains question marks (maybe this issue occurs also with other special characters). In openml missing values in an arff file are labeled as a question mark, see for example the prediction object in http://www.openml.org/r/506373, which looks like:

@relation 'run$predictions'
@attribute 'repeat' numeric
@attribute 'fold' numeric
@attribute 'row_id' numeric
@attribute 'prediction' {'good','bad'}
@attribute 'truth' {'good','bad'}
@attribute 'confidence.good' {FALSE, TRUE}
@attribute 'confidence.bad' {FALSE, TRUE}
@data
0 0 490 ? "good" ? ?
0 0 406 ? "good" ? ?
0 0 139 ? "good" ? ?
0 0 482 ? "good" ? ?

While RWeka is able to read this, farff is failing.

library(OpenML)

# fails
setOMLConfig(arff.reader = "farff")
d = getOMLRun(506373)

# works
setOMLConfig(arff.reader = "RWeka")
d = getOMLRun(506373)

It seems that the error happens in c_rd_preproc, it produces for the first four lines:

00490NA"good"NANA
00406NA"good"NANA
00139NA"good"NANA
00482NA"good"NANA
berndbischl commented 8 years ago

Isnt the problem that data values are supposed to be separated by a comma?

http://www.cs.waikato.ac.nz/ml/weka/arff.html

"Each instance is represented on a single line, with carriage returns denoting the end of the instance. Attribute values for each instance are delimited by commas."

berndbischl commented 8 years ago

@joaquinvanschoren

giuseppec commented 8 years ago

Ok, if this is the case, I have to check the creation process of the prediction file (it might be that no commas are added if there are NA columns) . However, we need at least a warning in farff (although RWeka would still be able to read arff files without commas).

joaquinvanschoren commented 8 years ago

Yes, it should be comma-separated.