Open invkrh opened 6 years ago
What's the data structure of the feature matrix X? Is it represented as sparse block matrices? Because the dimension is very high, you need an l2 regularization. Unless the l2 regularization is too small, the size of a partition ($s$) does not have to be as large as $d$; in fact, $s$ just depends on a so-called "intrinsic" dimension which is a small constant.
Sorry, I was not clear. Here are some detailed information:
The input feature matrix X (~ 80GB) is sparse (d = 2^26, n = 28k * 1000) which is split into 1000 partitions
For each partition, we have around 28k data points.
The local A
matrix for the conjugate gradient is also sparse:
non-sparsity of A
= 14540288 / (67108865 X 28399) = 7.629394417563164E-6
, where 67108865 = 2^26
14540288 Double value sums up around to 128mb
In CG method, we have val r = gradient - dir * lambda - A * (A.t * dir)
where gradient
is a DenseVector of dim 2^26, which sums up to 536mb. In addition, several temporory vectors are also dense.
This will quickly cause OOM exception which is what we observed.
As you mentioned in the paper, we can have fewer data points than dim d
provided that we do backtrack line search. I am not sure I am understanding the intrinsic
dimension correctly. Could you tell me more ? How to compute it ?
Sorry, it is actually the effective dimension
.
Let $\lambda_j$ be the $j$-th eigenvalue of the Hessian matrix, for j = 1 to d. Let $\gamma$ be the regularization.
The effective dimension
is defined as the sum of $\lambda_j / (n \gamma + \lambda_j )$ for j = 1 to d. If $\gamma = 0$, then the effective dimension
is exactly $d$. If $n\gamma$ is comparable to the top eigenvalues, say the 10th largest eigenvalue, then the effective dimension
is very very small, just several tens.
In the paper, the local sample size $s$ depends on $d$, which is a very loose bound. It actually depends on the effective dimension
rather than $d$. Therefore, unless $\gamma$ is too small, the local sample size $s$ needn't be as large as $d$.
As for the scalability problem, I think GIANT, as well as the gradient descent methods, does suffer from a memory and communication problem, because $d$ is too large. If you use only the L2 regularization, things may not work out.
As $d$ is too large, you may want to try the L1+L2 (elastic net) regularization. It will make the model sparse and thus alleviate the communication and storage.
I would recommend first trying some build-in function of Spark MLlib. If they are too slow, then you may want to try a variant of GIANT: https://github.com/wangshusen/SparkGiant/tree/master/src/main/scala/logisticl1l2 But the code is not well written and has not been tested on a cluster.
I just reimplement GIANT in Spark and test it on real-world data for CTR prediction which is very high dimensional d = 2^26. In order to have a better hessian approximation, each partition of RDD has to be very large, say at least 2^26 data points. Even if the memory permits, the spark job can neven finish, based on my test, due to Executor heartbeat exceed.
What is the best way to deal with this kind of data in GIANT? Thanks