mlr-org / farff

a faster arff parser
Other
11 stars 6 forks source link

currently we cannot parse sparse files. #4

Open berndbischl opened 9 years ago

berndbischl commented 9 years ago

we need to

a) throw an error now, if a sparse file is detected

b) figure out which sparse matrix reader on cran exists that parses something close to sparse ARFF

Here is a list of sparse file DIDs: 350, 386, 391, 397, 401

berndbischl commented 9 years ago

Maybe this helps,

http://glaros.dtc.umn.edu/gkhome/fetch/sw/cluto/manual.pdf

https://cran.r-project.org/web/packages/slam/index.html

berndbischl commented 8 years ago

I have added some code to detect sparse files and throw an error if so

berndbischl commented 8 years ago

it is also doced in readARFF

jakobbossek commented 8 years ago

Some more data IDs of sparse file:

sparse.data.ids = c(273, 292, 293, 350, 383, 384, 385, 386, 387, 388, 389, 390, 391, 392, 393, 394, 395, 396, 397, 398, 399, 400, 401)
jakobbossek commented 8 years ago

ad b) the data section in sparse arff files is kind of similar to the CLUTO data format for sparse matrizes. However, the arff format slightly deviates from CLUTO: 1) the first line in CLUTO contains three integersn m l where n is the number of rows, m is the number of columns and l is the total number of non-zero entries is the matrix. 2) the entries are separated by whitespace and not by comma. 3) single lines are not wrapped in curly braces

In order to use, e.g., slam::read_stm_CLUTO(file), we need to preprocess. 1) We get n by counting lines, m by counting the attributes in the header. l is harder, but feasible. 2) Should not be that hard. 3) Trivial

mllg commented 8 years ago

Note that slam just reads the compelte file into memory, converts strings to double and then feeds the dense data into a sparse matrix.

If you can do this on a file, you just can use fread.

mllg commented 8 years ago

Oh damn, I am wrong. Nevermind. But just look at the Implementation, it is easy to adept.

giuseppec commented 8 years ago

Our ARFF sparse data looks like

{0 -1, 1 1, 10 1, 14 1, 19 1, 39 1, 40 1, 52 1, 61 1, 67 1, 72 1, 74 1, 76 1, 82 1, 83 1}
{0 -1, 1 1, 10 1, 14 1, 19 1, 39 1, 40 1, 52 1, 63 1, 67 1, 73 1, 74 1, 76 1, 82 1, 83 1}
{0 1, 1 1, 10 1, 14 1, 19 1, 39 1, 40 1, 52 1, 63 1, 67 1, 73 1, 74 1, 77 1, 80 1, 83 1}
{0 -1, 1 1, 10 1, 14 1, 19 1, 39 1, 42 1, 52 1, 62 1, 67 1, 72 1, 74 1, 76 1, 78 1, 83 1}

Maybe another possibility is the read.matrix.csr function from the e1071 package, which is able to read files that look like:

1:-1 2:1 11:1 15:1 20:1 40:1 41:1 53:1 62:1 68:1 73:1 75:1 77:1 83:1 84:1 
1:-1 2:1 11:1 15:1 20:1 40:1 41:1 53:1 64:1 68:1 74:1 75:1 77:1 83:1 84:1 
1:1 2:1 11:1 15:1 20:1 40:1 41:1 53:1 64:1 68:1 74:1 75:1 78:1 81:1 84:1 
1:-1 2:1 11:1 15:1 20:1 40:1 43:1 53:1 63:1 68:1 73:1 75:1 77:1 79:1 84:1 

So we just have to

  1. replace whitespace between two numbers with :
  2. remove the commas , and curly brackets { and }
  3. increment the number before the : by one (maybe this helps http://stackoverflow.com/questions/12941362/is-it-possible-to-increment-numbers-using-regex-substitution)
berndbischl commented 8 years ago

hmm this indeed looks doable?