Open yunseong opened 8 years ago
@jsjason, @johnyangk, @beomyeol , @gyeongin and I had a discussion and it'd be good to check followings:
ComputeTask
s' computation time per iteration.0-th iteration accuracy: 0%
1-th iteration accuracy: 65.96%
2-th iteration accuracy: 65.96%
3-th iteration accuracy: 65.96%
The result were almost same (similar accuracy value, fixed after "1-th iteration"(2nd actually)) in Vortex as I referred most part of the algorithm from Dolphin's. Even when using the full data set, the result was still similar. Spark, on the other hand, works differently: 1) the accuracy grows as iteration goes, and 2) the ultimate accuracy is higher with the same iteration. It'd be worth taking a look at algorithm for correctness.
If you have more things to want to check or ask, please feel free to add. After that, I think this issue can be split into multiple items.
I pushed a branch named gy-lr-test
which uses scala library breeze
instead of mahout
. I made some changes to save memory and improve performance. With the entire URL reputation dataset and R730-02 machine(48core cpu, 128GB memory), the job took 263 seconds.
This is the command I used for test:
./run_logistic.sh -dim 3231961 -maxIter 20 -stepSize 1.0 -lambda 0.01 -local false -split 8 -input /total -output output_logistic -maxNumEvalLocal 5 -isDense false -evalSize 1000 -timeout 1200000
The final model accuracy was 93.421%.
@gyeongin Thanks for sharing the result. This looks awesome! Could you kindly give us a very short summary what changes you made for better performance and memory usage?
@gyeongin Great!
Changes I made:
breeze
instead of mahout
(mahout
does not support inplace update)DenseVector
for model
and cumGradient
(dv + sv is much faster than sv + sv)ComputeTasks
update the model only once in each iteration
We've run experiments by using same LR algorithm with URL reputation dataset on multiple frameworks: Dolphin, Vortex. As @beomyeol mentioned at the meeting, we've seen some performance issues such as vector computation, data loading, etc. We can also take a look at Spark because it can run LR algorithm and the performance turned out to be much faster than Vortex (not sure compared to Dolphin yet).
This issue aims to investigate the performance of both frameworks as we can run the same algorithm on the same data set. It would be great if we can find some points to improve in performance.
As a first step, I'll run the experiment on Microsoft YARN cluster which consists of 20 machines (8core CPU, 8GB RAM, YARN 2.7.1).