Implementation bug in ColumnIterator?

wuciawe commented 8 years ago

In the line of the file, it looks like it should be val cols = Array.fill[Int](matrix.rows)(index) instead of val cols = Array.fill[Int](matrix.cols)(index) ?

rjagerman commented 8 years ago

I think you're right, that looks like a bug, it wouldn't make much sense to use matrix.cols there. Thanks for the report! The fix should be really easy, feel free to submit a pull request or I can do it as well later today.

Seems like this is missed by the unit tests. We'll also have to write some unit tests that test for this case, so it doesn't retroactively happen again in the future.

wuciawe commented 8 years ago

create a pull request #50 , only target to simply fix this bug, without adding tests. This method is protected, I have very limited experience in writing tests, so I don't know how to write tests to non-public method properly. also tests are needed for RowIterator etc.

wuciawe commented 8 years ago

oh, I think there is still a potential bug, val rows = (0L until matrix.rows).toArray and val cols = Array.fill[Int](matrix.rows.toInt)(index), there may exceed the limit length of Array.

rjagerman commented 8 years ago

Thanks, I'll look into that! :+1: I think it would be best to add a check to see if matrix.rows exceeds the size of an integer and return a failed future otherwise.

For the tests, I'll see if I can write some up tomorrow.

rjagerman commented 8 years ago

The check for matrix.rows size has been added in PR #51.

wuciawe commented 8 years ago

And I managed to add tests for fetchNextFuture in ColumnIterator and RowBlockIterator in #52 .

rjagerman commented 8 years ago

Thanks! It looks great :+1:

wuciawe commented 8 years ago

eh, after further reading the code, I find the real leak in the init implementation, in this block of code, it shortcuts the cols as the test has a smaller row. So the appropriate test should be something like in #53 . And the tests in #52 is meaninglesss, as they should be tested via public interface next.

By the way, in the near future, I may have to implement logistic regression with parameter server. Do you know whether there have somebody already been working on that based on glint?

rjagerman commented 8 years ago

Yeah, now that you say it, it's probably better to test the public interface next, it also makes the tests a bit more readable.

I think @mlnick has been working on a basic logistic regression implementation. I plan to implement logistic regression, SVMs, etc. myself but haven't gotten around to it yet due to other obligations. One of the major difficulties I ran into is regularization. L2 regularization would require scalar multiplication on the entire distributed vector, which is not yet implemented and will probably require some code refactoring at certain parts. Unregularized linear methods should be relatively easy to implement though.

wuciawe commented 8 years ago

I roughly think about it today, I find it hard to enable arbitrary user define function on the server side. If I have enough time, I will take a look at the code of parameterserver.org to see how they handle the problem.

wuciawe commented 8 years ago

L2 regularization would require scalar multiplication on the entire distributed vector

Do you mean you want something like: to update the parameter vector?

So the approach will be something like:

in each iteration:

the server issues tasks to the worker nodes to compute $(h(x^i) - y^i)x_j^i$
the server collects the results from the worker nodes
the server side do scalar multiplication on parameter vector and then minus them with corresponding parameter (looks like a temporary vector is needed to store those collected results from worker nodes)

do I guess right?

rjagerman commented 8 years ago

Yes, the idea is correct, but I should note that the tasks are not issued by the parameter servers and instead job scheduling is intended to be handled by Spark.

Spark issues workers to operate on subsets of some data set. These workers then asynchronously pull parts of the model and push updates to the parameter server. So the point of view is definitely different from other parameter server implementations, because Glint functions more akin to a key-value store. It has no concept of tasks or workers and merely provides distributed matrices and vectors that can be read and updated efficiently. So it would look something like this:

At the beginning of each iteration, the Spark driver will need to push a scalar multiplication to the parameter servers ($theta_j = theta_j(1 - \alpha \lambda / m$)
Spark issues workers to operate subsets of the data (rdd.foreachPartition { ... })
Each worker pulls (part of) the model it needs
Each worker locally computes gradient $(h(x^i) - y^i)x_j^i$
Each worker pushes its gradient to the parameter servers

Perhaps you will find this interesting as well: https://github.com/rjagerman/glintlda. It is an implementation of LightLDA using Glint. Although it is a completely different algorithm, it provides a good example of using Glint for a complex ML task. The code may be a bit hard to read though due the many optimizations that it uses.

wuciawe commented 8 years ago

I've read the code of glintlda yesterday, and currently reading the code of parameterserver.org's code optimizing with L1L2 reg. The code of parameterserver.org is c++(didn't program with c++ for years) and the project is quite large, it may cost serveral days to understand the implementation of parameterserver.org.

With your proposal on logistic regression with L2, it has strong consistent (different from glintlda, which I think it's bounded staling, as you use futures with locks to enable little inconsistency. am I right? or what is the usage of semaphore? ). If I'm right, it maybe also possible to be bounded staling with something like in glintlda. and multiply a scale is something like a mapper function.

But with L1, I think it needs to be strong consistent or we need to keep the old gradients of delayed works so as to enable bounded staling, which may increase the complexity of the system. and with L1, it will require something like mapper function to update weights.

And I also find that in yahoo's implementation, the parameter server provides map, reduce, and aggregate interface on server side. That may make the system more powerful.

MLnick commented 8 years ago

Hi there

I have been working on some basic linear models with Glint - mostly logistic regression. I've used an implementation based on the older Spark MLlib gradient descent primitives.

It's fairly easy to do LR with L2 regularization and I get the same results as MLlib. However, I haven't really run larger scale experiments yet where the async impact on consistency could be greater. L1 is more challenging though and I think will definitely require some form of computation on the param servers themselves. How to handle that in Glint is an interesting issue - I think there must be some form of UDFs supported for these types of operations - whether supporting shipping some function to the servers, or plugging in a custom Actor of some sort.

wuciawe commented 8 years ago

@rjagerman @MLnick Hi, In case of large scale, where even each of the item is sparse, the items splitted into each executor may require a large portion set of parameters, and thus require large amount of memory and cause some problems. I think we can use the block coordinate descent can be used to avoid this kind of problem.

rjagerman / glint

Implementation bug in ColumnIterator? #49