quinngroup / dr1dl-pyspark

Dictionary Learning in PySpark
Apache License 2.0
1 stars 1 forks source link

Select top R #35

Closed magsol closed 8 years ago

magsol commented 8 years ago

@LindberghLi:

I'm helping @MOJTABAFA with unit testing, and in particular the selection of the top R elements seems to be critical. In checking that our function works correctly, I've been referencing the C++ implementation (line 115) and I have a question.

Specifically, if nth_element provides a partial sorting of the array wherein the larger elements are on the left of the nth element, and the smaller elements are to the right, why then is the loop over all N elements on line 122 needed? Surely you only need to loop over the first R elements (guaranteed to evaluate to true in the if statement on line 124). Or am I missing something?

XiangLi-Shaun commented 8 years ago

@magsol YES!!! You are right, that is a big issue I essentially ignored during the coding. As v is partially sorted, I will not need to construct the idxs_n as well as other stuffs, but just use the first R elements in v. I'll change the c++ code and the pseudocode immediately. Thanks!

magsol commented 8 years ago

Sounds great, thanks.

To cement my understanding, I rewrote your pseudocode on the wiki for this project. Please have a look and let me know if anything is wrong or missing. We will likely have to consolidate many of these operations for them to scale on Spark, so an intimate understanding of the algorithm is crucial to make this work as efficiently as possible.

https://github.com/quinngroup/pyspark-dictlearning/wiki/Rank-1-Dictionary-Learning-Pseudocode

In particular: what exactly are the output variables D and Z? I know D is the dictionary, the basis vectors, but what is Z?

XiangLi-Shaun commented 8 years ago

@magsol I reviewed the pseudocode and it's generally good. I have updated it on several minor places. Regarding the line:

idxs_n = top(v, R)

according to my testing do the full sorting would be very inefficient in the non-parallel scenario. But for the parallel case it might be different.

All the u_new vectors will be stored in D matrix (to the total of M vectors); All the v vectors will be stored in z matrix (to the total of M vectors);

magsol commented 8 years ago

according to my testing do the full sorting would be very inefficient in the non-parallel scenario. But for the parallel case it might be different.

No question; I put that comment there just to provide some intuition to users who may be curious what "top-R" selection is. It's more easily explained using sorting, but it's more efficiently implemented using other methods, hence why I referred to sorting as the "naive" method.

The diff = 2 * epsilon statement is just to make the code coherent. It is essentially a "while" loop conditioned on two statements: the number of iterations, and the convergence delta epsilon. Since the vectors for computing the convergence delta are randomly initialized, I wanted to ensure that it would be impossible for them to "randomly" have a diff that is less than epsilon by making the initial diff (2 * epsilon). That just cleans up the code a bit by not requiring a break statement.

What is Z (other than the store of v vectors)? What is the interpretation of this matrix?

XiangLi-Shaun commented 8 years ago

@magsol I have updated the document. In real applications D and/or z might be useful depends on the problem specification, where D is the basis signals to code the data and z contains the information of how the coding is done.