Spark logistic regression issues

szilard commented 9 years ago

Splitting #5 in two: logistic regression here and random forest in a different issue.

Summary: Logistic regression has lower AUC in Spark.

For n=1M Spark gets AUC = 0.703 while R/Python etc. AUC = 0.711.

Code here https://github.com/szilard/benchm-ml/blob/master/1a-spark-logistic/spark.txt Train data here https://s3.amazonaws.com/benchm-ml--spark/spark-train-1m.csv test data here https://s3.amazonaws.com/benchm-ml--spark/spark-test-1m.csv

Spark version used: 1.3.0

szilard commented 9 years ago

Thanks @mengxr for comments and code in https://github.com/szilard/benchm-ml/issues/5.

In fact I never complained about Spark speed/memory footprint concerning logistic regression. A few points:

"The linear models are not the primary focus of this study because of their not so great accuracy vs the more complex models (on this type of data). They are analysed here only to get some sort of baseline." "The main conclusion here is that it is trivial to train linear models even for n = 10M rows virtually in any of these tools on a single machine in a matter of seconds." So, I think all tools are doing great in LR for n=10M, including Spark. So if something runs for 5 sec or 30sec, it does not matter a lot (to me in this benchmark). Also if it uses 2GB or 10GB, since both are ridiculously small.
The only thing I noticed/complained about Spark logistic regression is that it has lower AUC than the other methods https://github.com/szilard/benchm-ml/blob/master/1-linear/x-plot-auc.png I've been using 1.3 and LogisticRegressionWithLBFGS, so hopefully using 1.4 and your code (LogisticRegression) will fix the issue. Gonna try soon.

szilard commented 9 years ago

@mengxr Re memory footprint:

It's not only the size of RDDs.

The way it should be measured is what is the lowest RAM the code can run, reads the data, trains etc. and never spills to disk (the whole RAM used by the java process). The way I could measure it is by changing the --driver-memory and --executor-memory options to the minimum values for which the code does not crash/spill to disk.

Instead I did what I do with the other tools, I set a large value for the parameters. I know Spark will be greedy and use more RAM this way and will not gc, so this will overestimate. I've been looking at RAM usage with htop or ps aux. It's not a big deal for LR as I said before, but to be more precise here is some experimentation (for n = 1M dataset as in your comments, 1.4 and your code):

On a 60GB box,

--driver-memory 30G --executor-memory 30G RAM after reading/caching data 10G, after train about the same, LR runtime 8.15 sec.

--driver-memory 1G --executor-memory 1G Using ~1.3G LR runtime 7.39 sec

--driver-memory 500m --executor-memory 1G WARN MemoryStore: Not enough space to cache rdd_39_22 in memory LR runtime 30.86s

--driver-memory 500m --executor-memory 500m Crash while reading the data: java.lang.OutOfMemoryError: Java heap space

So, RAM usage is not 200MB, but more like ~1.3GB (that is that's the minimum free RAM it can run on a box - but then I'd need to measure the other tools the same way as well). Again, for LR no big deal, all tools work nicely on 10M records :)

szilard commented 9 years ago

@mengxr When running your LR code on 1.4 the training goes fine, but the AUC calculation crashes:

scala> evaluator.evaluate(model.transform(d_test))
[Stage 112:>                                                       (0 + 0) / 32]Exception in thread "dag-scheduler-event-loop" java.lang.OutOfMemoryError: PermGen space
        at sun.misc.Unsafe.defineClass(Native Method)
        at sun.reflect.ClassDefiner.defineClass(ClassDefiner.java:63)

Any idea how to fix this? I cannot get the AUC :(

szilard commented 9 years ago

@mengxr I managed to run your LR code on spark-1.4.0-bin-hadoop2.4 (I previously tried spark-1.4.0-bin-hadoop2.6 and it was crashing, see above).

Good news: the accuracy issue with logistic regression is fixed with your code and 1.4.

I also changed the memory footprint results along the lines discussed above in this issue.

One thing to note: the code you provided runs slower than my original code for n=10M: previously 15 sec, not 35 sec. I used LogisticRegressionWithLBFGS, your code uses LogisticRegression, not sure what that is calling under the hood.

Let me know if you want to discuss this (speed) or the memory footprint more, otherwise I'll close the issue (the accuracy is fixed).

petro-rudenko commented 8 years ago

Where could i get spark-train-10m.csv and spark-test-10m.csv to reproduce spark linear model benchmark? How did you done preprocessing, since it's differs from R generated data?

szilard commented 8 years ago

@petro-rudenko All the above (in this thread/issue) has been done with the 1M row training set, not 10M. You can find the 1-hot encoded 1M rows training set and the 100K row test set with the links in the first comment in this thread/issue.

However, I think it would be better if you start with the original dataset and you do the 1-hot encoding in Spark (AFAIK there are helpers now in 1.4 but I did not try that). So, how about using this 1M training set https://s3.amazonaws.com/benchm-ml--main/train-1m.csv and this 100K test set https://s3.amazonaws.com/benchm-ml--main/test.csv ? Then you can do the 1-hot encoding the best way you think fit. Let's see how it performs AUC and runtime wise and if you submit code that works directly with the original dataset I can switch to your code in the repo.

szilard commented 8 years ago

@petro-rudenko now 1-hot encoding done in Spark https://github.com/szilard/benchm-ml/blob/master/z-other-tools/5xa-spark-1hot.txt thanks to Joseph Bradley of Databricks

szilard / benchm-ml

Spark logistic regression issues #17