Closed szilard closed 8 years ago
Thanks @mengxr for comments and code in https://github.com/szilard/benchm-ml/issues/5.
In fact I never complained about Spark speed/memory footprint concerning logistic regression. A few points:
LogisticRegressionWithLBFGS
, so hopefully using 1.4 and your code (LogisticRegression
) will fix the issue. Gonna try soon.@mengxr Re memory footprint:
It's not only the size of RDDs.
The way it should be measured is what is the lowest RAM the code can run, reads the data, trains etc. and never spills to disk (the whole RAM used by the java process). The way I could measure it is by changing the --driver-memory
and --executor-memory
options to the minimum values for which the code does not crash/spill to disk.
Instead I did what I do with the other tools, I set a large value for the parameters. I know Spark will be greedy and use more RAM this way and will not gc, so this will overestimate. I've been looking at RAM usage with htop
or ps aux
. It's not a big deal for LR as I said before, but to be more precise here is some experimentation (for n = 1M
dataset as in your comments, 1.4 and your code):
On a 60GB box,
--driver-memory 30G --executor-memory 30G
RAM after reading/caching data 10G, after train about the same, LR runtime 8.15 sec.
--driver-memory 1G --executor-memory 1G
Using ~1.3G LR runtime 7.39 sec
--driver-memory 500m --executor-memory 1G
WARN MemoryStore: Not enough space to cache rdd_39_22 in memory
LR runtime 30.86s
--driver-memory 500m --executor-memory 500m
Crash while reading the data: java.lang.OutOfMemoryError: Java heap space
So, RAM usage is not 200MB, but more like ~1.3GB (that is that's the minimum free RAM it can run on a box - but then I'd need to measure the other tools the same way as well). Again, for LR no big deal, all tools work nicely on 10M records :)
@mengxr When running your LR code on 1.4 the training goes fine, but the AUC calculation crashes:
scala> evaluator.evaluate(model.transform(d_test))
[Stage 112:> (0 + 0) / 32]Exception in thread "dag-scheduler-event-loop" java.lang.OutOfMemoryError: PermGen space
at sun.misc.Unsafe.defineClass(Native Method)
at sun.reflect.ClassDefiner.defineClass(ClassDefiner.java:63)
Any idea how to fix this? I cannot get the AUC :(
@mengxr I managed to run your LR code on spark-1.4.0-bin-hadoop2.4 (I previously tried spark-1.4.0-bin-hadoop2.6 and it was crashing, see above).
Good news: the accuracy issue with logistic regression is fixed with your code and 1.4.
I also changed the memory footprint results along the lines discussed above in this issue.
One thing to note: the code you provided runs slower than my original code for n=10M
: previously 15 sec, not 35 sec. I used LogisticRegressionWithLBFGS
, your code uses LogisticRegression
, not sure what that is calling under the hood.
Let me know if you want to discuss this (speed) or the memory footprint more, otherwise I'll close the issue (the accuracy is fixed).
Where could i get spark-train-10m.csv and spark-test-10m.csv to reproduce spark linear model benchmark? How did you done preprocessing, since it's differs from R generated data?
@petro-rudenko All the above (in this thread/issue) has been done with the 1M row training set, not 10M. You can find the 1-hot encoded 1M rows training set and the 100K row test set with the links in the first comment in this thread/issue.
However, I think it would be better if you start with the original dataset and you do the 1-hot encoding in Spark (AFAIK there are helpers now in 1.4 but I did not try that). So, how about using this 1M training set https://s3.amazonaws.com/benchm-ml--main/train-1m.csv and this 100K test set https://s3.amazonaws.com/benchm-ml--main/test.csv ? Then you can do the 1-hot encoding the best way you think fit. Let's see how it performs AUC and runtime wise and if you submit code that works directly with the original dataset I can switch to your code in the repo.
@petro-rudenko now 1-hot encoding done in Spark https://github.com/szilard/benchm-ml/blob/master/z-other-tools/5xa-spark-1hot.txt thanks to Joseph Bradley of Databricks
Splitting #5 in two: logistic regression here and random forest in a different issue.
Summary: Logistic regression has lower AUC in Spark.
For
n=1M
Spark getsAUC = 0.703
while R/Python etc.AUC = 0.711
.Code here https://github.com/szilard/benchm-ml/blob/master/1a-spark-logistic/spark.txt Train data here https://s3.amazonaws.com/benchm-ml--spark/spark-train-1m.csv test data here https://s3.amazonaws.com/benchm-ml--spark/spark-test-1m.csv
Spark version used: 1.3.0