openml / OpenML

Open Machine Learning
https://openml.org
BSD 3-Clause "New" or "Revised" License
659 stars 91 forks source link

SparseARFF support #79

Closed berndbischl closed 10 years ago

berndbischl commented 10 years ago

We are currently converting some of our large scale files from libsvm sparse format to spare ARFF.

The hope is that OpenML can support this on the server side natively.

Can I get some feedback on this?

janvanrijn commented 10 years ago

I assume sparse arff as described in http://www.cs.waikato.ac.nz/ml/weka/arff.html ?

berndbischl commented 10 years ago

Yes.

berndbischl commented 10 years ago

I have forwarded the thread to Aydin, who might work on this on our side.

berndbischl commented 10 years ago

This would also help to resolve #42

joaquinvanschoren commented 10 years ago

FYI, I have already added an 'original data url' field to the dataset upload form. Thus, you can use that to point to the original version of the libsvm dataset somewhere online.

You can use the same field for your derived imbalanced datasets. Use, for instance, http://openml.org/d/1 as the url and we'll link the two.

On 15 August 2014 14:55, berndbischl notifications@github.com wrote:

This would also help to resolve #42 https://github.com/openml/OpenML/issues/42

— Reply to this email directly or view it on GitHub https://github.com/openml/OpenML/issues/79#issuecomment-52302403.

Joaquin Vanschoren about me http://www.win.tue.nl/~jvanscho/

berndbischl commented 10 years ago

Ok, great. This "original ref URL" seems a good idea in general.

aydindemircioglu commented 10 years ago

sorry, if i missed it: is sparse arff format is supported on the serverside or not?

@bernd: the RWeka reader supports sparse arff, at least i could read back my arff files. but reading larger datasets crashes with an out-of-memory-error. so a proper reader might indeed be necessary.

joaquinvanschoren commented 10 years ago

You didn't miss it. My guess is that it will upload fine, the question is whether the data qualities will be computed correctly.

Give it a try, and upload it through the website. If the data qualities fail we can fix that later.

On 15 August 2014 15:17, aydin demircioglu notifications@github.com wrote:

sorry, if i missed it: is sparse arff format is supported on the serverside or not?

@bernd https://github.com/bernd: the RWeka reader supports sparse arff, at least i could read back my arff files. but reading larger datasets crashes with an out-of-memory-error. so a proper reader might indeed be necessary.

— Reply to this email directly or view it on GitHub https://github.com/openml/OpenML/issues/79#issuecomment-52304178.

Joaquin Vanschoren about me http://www.win.tue.nl/~jvanscho/

aydindemircioglu commented 10 years ago

ok, i will try later today. for now the qualities are probably not that important, as our data is numeric, regardless what it was 'in reality'.

janvanrijn commented 10 years ago

As Joaquin says, it should be supported (since we use the Weka Arff Loader). I have never tested it, yet. Will do so ASAP, and let you know.

berndbischl commented 10 years ago

@aydin: Ok, I dont understand how R Weka can support, but still fail with out-of-mem... But whatever. Or is the problem that the data gets expanded in R? Which we can never work around?

Regarding the data qualities: It is important that we have them on the server. We need them to find / select data sets for studies.

berndbischl commented 10 years ago

Another addon to the qualities:

In OpenML / R I kinda have two two different data quality groups: a) Simple stuff that is also cheap to compute. Eg number of features, number of observations, and so on.

b) Complex Meta-Learning stuff.

Note that b) might take LONG to compute for large data. And also note that I always look at a), but never-ever have needed b) until now. So maybe it might be a good idea to separate them on the server as well? Because sometimes computing b) might not be feasible?

aydindemircioglu commented 10 years ago

@bernd: for me 'support' is a weak term, it works with small datasets (so it works in principle), but crashes with larger ones. as the error messages hint to some cryptic java heap problem, i do not think there is anything to be done in R. weka itself does crash on larger sets with the same memory problem. the hint is there to increase the heap size. i assume RWeka just wraps this function. so if one can control the default heap size of RWeka, it probably will also read larger sparse arff files.

btw, the foreign package does not seem to support sparse format.

berndbischl commented 10 years ago

Can you post the error message here (although I guess I already know it)

You need indeed to increase the java mem of the process. IIRC you can do this via rJava, which RWeka builds upon.

berndbischl commented 10 years ago

I mean the point is: We need to figure out whether the RWeka parser is "good enough" for this. I can live with setting the java heap mem option, if it works then. But if the parser is so bad it just wastes memory in general, we might need to rewrite it. Would be good if you could add an informed opinion on that topic ;-)

aydindemircioglu commented 10 years ago

RWeka:

t = read.arff ("../arthrosis/arthrosis.arff") Error in .jnew("weka/core/Instances", .jcast(reader, "java/io/Reader")) : java.lang.OutOfMemoryError: GC overhead limit exceeded

weka:

Not enough memory (less than 50MB left on heap). Please load a smaller dataset or use a larger heap size.

Note: The Java heap size can be specified with the -Xmx option. E.g., to use 128MB as heap size, the command line looks like this: java -Xmx128m -classpath ... This does NOT work in the SimpleCLI, the above java command refers to the one with which Weka is started. See the Weka FAQ on the web for further info. Exception in thread "Thread-2" java.lang.OutOfMemoryError: GC overhead limit exceeded java.util.Arrays.copyOfRange(Arrays.java:2694)

berndbischl commented 10 years ago

How big is the data file in expanded mem? How many rows and cols?

berndbischl commented 10 years ago

Try to set the Xmx option like this:

so rJava does not kill session on CTRL-C

options(java.parameters="-Xrs")

Maybe you have to do this, before you load RWeka at all. (The line above is from my Rprofile and probably useful for you too....)

aydindemircioglu commented 10 years ago

sorry, forgot this: arthrosis has 262,142 points, and 178 features.

the sparse file format can be read in R with the read.matrix.csr() function, and it does so without problems. if i know how to measure memory in R i can report the memory footprint too

will try your java hints now.

berndbischl commented 10 years ago

if i know how to measure memory in R i can report the memory footprint too

Run gc() after the call and post the print out.

Also call object.size on the parsed data.

aydindemircioglu commented 10 years ago

uploading of a smaller dataset (australian) did work. http://openml.org/d/292

i will now try to upload arthrosis, while my system microbenchmarks read.arff.

aydindemircioglu commented 10 years ago

oops, i just realize that i am not allowed to upload arthrosis. i will pick another covtype instead with 581012 points and 54 dimensions

aydindemircioglu commented 10 years ago

uploading covertype was also completed successfully. so far the webinterface seems pretty able to parse sparse arff with binary features.

aydindemircioglu commented 10 years ago

i'm giving up on microbenchmark, a simple system.time should have been better. here are the numbers:

Type 'q()' to quit R.

options(java.parameters="-Xrs")

options(java.parameters="-Xmx2048m") library(RWeka) library(microbenchmark) gc() used (Mb) gc trigger (Mb) max used (Mb) Ncells 320803 17.2 467875 25.0 407500 21.8 Vcells 555131 4.3 1031040 7.9 905725 7.0 d = read.arff("arthrosis.arff") gc() used (Mb) gc trigger (Mb) max used (Mb) Ncells 328204 17.6 597831 32.0 409792 21.9 Vcells 47358988 361.4 55453861 423.1 50112709 382.4 object.size(d) 374358432 bytes system.time(read.arff("arthrosis.arff")) user system elapsed 51.285 0.240 50.449 system.time(read.arff("arthrosis.arff")) user system elapsed 43.223 0.196 42.343 system.time(read.arff("arthrosis.arff")) user system elapsed 41.244 0.248 39.862

this seems quite efficient-- here is the same test for the sparse data format, read using e1071, which always felt like being programmed in c64 basic:

Type 'q()' to quit R.

library(e1071)
gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 185644 10.0 407500 21.8 350000 18.7
Vcells 379684 2.9 905753 7.0 864972 6.6
d = read.matrix.csr("arthrosis.combined.scaled") gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 309813 16.6 38747714 2069.4 46090137 2461.5
Vcells 104266256 795.5 326016344 2487.4 407181076 3106.6
system.time (read.matrix.csr("arthrosis.combined.scaled"))
user system elapsed 397.630 2.423 405.001 object.size(d) 561983168 bytes

so from that point i would not want to rewrite the arff reader. but i need a much better sparse format reader. is there any one available? last year i could only find e1071.

berndbischl commented 10 years ago

Just quickly "googled" via rseek, have you checked this?

http://www.rdocumentation.org/packages/futile.matrix/functions/read.matrix

aydindemircioglu commented 10 years ago

i had checked it back then, just rechecked by reading arthrosis-- but it works on that file for more than an hour by now, and no signs of finishing. reading a smaller file like australian raises questions, it seems that read.matrix read the whole lines as one string instead of parsing the entries. the manual talks about triplet forms, i do not know what that is, and it seems that its not the libsvm-format..

berndbischl commented 10 years ago

the e1071 parser code is barely half a page of R code. glancing at it, it does not really seem to be good code. is a libsvm sparse parser important for you? reimplement that in C? we should also discuss this not here :)

aydindemircioglu commented 10 years ago

probably its easier to wrap the original libsvm-code, it should be c anyway. for now i'm set, if i am bored in near future, i will see into the problem. but yes, we should discuss it somewhere else.

from my side, this issue can be closed, numeric features do work with the web-interface. (for other feature types like strings the issue should be reopened if necessary).

berndbischl commented 10 years ago

@aydindemircioglu

1) If you want to discuss parsing of such machine learning files in R (independent of OpenML), open later up a issue in mlr.

2) Please upload our files to OpenML then.