Open berndbischl opened 9 years ago
I have added some code to detect sparse files and throw an error if so
it is also doced in readARFF
Some more data IDs of sparse file:
sparse.data.ids = c(273, 292, 293, 350, 383, 384, 385, 386, 387, 388, 389, 390, 391, 392, 393, 394, 395, 396, 397, 398, 399, 400, 401)
ad b) the data section in sparse arff files is kind of similar to the CLUTO data format for sparse matrizes. However, the arff format slightly deviates from CLUTO:
1) the first line in CLUTO contains three integersn m l
where n
is the number of rows, m
is the number of columns and l
is the total number of non-zero entries is the matrix.
2) the entries are separated by whitespace and not by comma.
3) single lines are not wrapped in curly braces
In order to use, e.g., slam::read_stm_CLUTO(file)
, we need to preprocess.
1) We get n
by counting lines, m
by counting the attributes in the header. l
is harder, but feasible.
2) Should not be that hard.
3) Trivial
Note that slam just reads the compelte file into memory, converts strings to double and then feeds the dense data into a sparse matrix.
If you can do this on a file, you just can use fread.
Oh damn, I am wrong. Nevermind. But just look at the Implementation, it is easy to adept.
Our ARFF sparse data looks like
{0 -1, 1 1, 10 1, 14 1, 19 1, 39 1, 40 1, 52 1, 61 1, 67 1, 72 1, 74 1, 76 1, 82 1, 83 1}
{0 -1, 1 1, 10 1, 14 1, 19 1, 39 1, 40 1, 52 1, 63 1, 67 1, 73 1, 74 1, 76 1, 82 1, 83 1}
{0 1, 1 1, 10 1, 14 1, 19 1, 39 1, 40 1, 52 1, 63 1, 67 1, 73 1, 74 1, 77 1, 80 1, 83 1}
{0 -1, 1 1, 10 1, 14 1, 19 1, 39 1, 42 1, 52 1, 62 1, 67 1, 72 1, 74 1, 76 1, 78 1, 83 1}
Maybe another possibility is the read.matrix.csr
function from the e1071
package, which is able to read files that look like:
1:-1 2:1 11:1 15:1 20:1 40:1 41:1 53:1 62:1 68:1 73:1 75:1 77:1 83:1 84:1
1:-1 2:1 11:1 15:1 20:1 40:1 41:1 53:1 64:1 68:1 74:1 75:1 77:1 83:1 84:1
1:1 2:1 11:1 15:1 20:1 40:1 41:1 53:1 64:1 68:1 74:1 75:1 78:1 81:1 84:1
1:-1 2:1 11:1 15:1 20:1 40:1 43:1 53:1 63:1 68:1 73:1 75:1 77:1 79:1 84:1
So we just have to
:
,
and curly brackets {
and }
:
by one (maybe this helps http://stackoverflow.com/questions/12941362/is-it-possible-to-increment-numbers-using-regex-substitution)hmm this indeed looks doable?
we need to
a) throw an error now, if a sparse file is detected
b) figure out which sparse matrix reader on cran exists that parses something close to sparse ARFF
Here is a list of sparse file DIDs: 350, 386, 391, 397, 401