Discussion on C++ implementation of keyness

koheiw commented 5 years ago

@jtatria, the code looks good.

Some questions/comments:

Margins are always the same so why don't you get that in the main function? https://github.com/quanteda/quanteda/blob/873ef4b78c47ea1c3a2786f41a394671a2d47548/src/keyness_mt.cpp#L33-L35

If a DFM has only two rows, dense matrix would be more efficient. https://github.com/quanteda/quanteda/blob/873ef4b78c47ea1c3a2786f41a394671a2d47548/src/keyness_mt.cpp#L123

jtatria commented 5 years ago

Thanks for the feedback, Kohei. I'll answer inline below:

Margins are always the same so why don't you get that in the main function?

https://github.com/quanteda/quanteda/blob/873ef4b78c47ea1c3a2786f41a394671a2d47548/src/keyness_mt.cpp#L33-L35

Mostly for separation of concerns. The fact that the specific functions make use of the marginals is contingent on the different measures used, which could also use some other global values from the DFM. In that case, I think its best to let each functor figure out its own parameters, without having to pass the marginals into the lambda factory function. I was also on the fence about which params to pass to the lambdas themselves... the last time I implemented something like this, the lambda constructor had access to the full matrix, but the lambda itself received both values and indices, in case some of the lambda variants needed access to other parts of the data matrix.

In any case, having it in the lambda constructor or the main function still executes that code only once per call to textstat_keyness, even though it is repeated verbatim in each lambda constructor... I could try to refactor it to consolidate that operation in a dedicated function or find some other solution, but given the ephemeral lifetime of C++ objects in R's execution model, it may be not worth it.

In other words: I would recommend having a standard function signature for these type of lambda functions, using only ( data X, metaparameters params... ) and letting the constructor extract its intermediate args, instead of ( data X, intermediate args for lambda args..., metaparameters params... ), to ease code reuse.

What do you think?

If a DFM has only two rows, dense matrix would be more efficient.

https://github.com/quanteda/quanteda/blob/873ef4b78c47ea1c3a2786f41a394671a2d47548/src/keyness_mt.cpp#L123

Yes, I agree that the current execution model using a doc's term vector and a reference's term vector could be done the same with dense vectors, but 1) I preferred to mess as little as possible with data structures passing in and out of the function, and 2) I would think that the logical extension to these parts of the code would be to consolidate different measures taking in DFM/FCM matrices, which are generally sparse.

Since dfm objects are kept in sparse matrices (as far as i can tell, this is correct, right?), building a dense vector would imply a copy overhead, and since we are parallelizing on entries and not over the doc vectors, we are not currently taking advantage of any cache alignment that could be useful if the measures were vector operations instead of point-wise operations.

There is also the problem of semantics: armadillo is not as transparent as e.g. Eigen in its interface to sparse versus dense matrices, so using dense matrices, as far as I can tell, would require slightly different operations (though I may be wrong here).

I haven't yet looked into any optimization at all, but I could wipe up a version copying the data to dense vectors and profiling to see if copy+dense is faster than sparse.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/quanteda/quanteda/issues/1635, or mute the thread https://github.com/notifications/unsubscribe-auth/AC4WYH8c-Yzq4pHct8TclgMvJe8yuyrdks5vSlJrgaJpZM4baTyE .

-- entia non sunt multiplicanda praeter necessitatem

koheiw commented 5 years ago

Thanks @jtatria.

I think generalizing functions is very important to make them readable and reusable. It is correct that marginals are computed only when this line is executed? If so, I am totally fine.
```
VFunc vfunc = get_v_func( dfm, measure, correction );
```
I saw significant difference between sparse vs dense matrices when I was writing simil_mt.cpp, but it we should deiced by comparing different implementation. If dense is faster, we can even pass dense matrix from R to C++. How do you take benchmarks then? I have home-grown tools in dev.h to measure execution time, but you might know a better way.

jtatria commented 5 years ago

1.- Yes, the marginals are computed only once inside get_v_func and captured by reference by the lambda object.

2.- Right, in the case of linear measures like those in simil_mt, the performance bonus of dense vectors will be huge, because of locality (i.e. all vector values neatly aligned in contigous memory regions), since these functions require walking down the entire vectors. Pointwise functions like the ones in keyness do not benefit from this so much, since they access values one at a time and the access overhead of resolving sparse indices is optimized away by the compiler.

In general, I'd be more concerned about the costs of copying back and forth between dense/sparse representations, but I agree that we should profile extensively.

Now from a more general point of view, the distinction above between linear and pointwise functions is rather critical from an implementation point of view, since the results of linear functions will always be dense, while pointwise functions will be as sparse as their input. In my previous work on this, I basically implemented two different worker strategies, one for pointwise functions that I called "weights" using sparse inputs and results and one for similarity/distances that took in either sparse or dense matrices as inputs using a templated worker, and always produced dense output.

koheiw commented 4 years ago

I wonder what happen to the C++ code. Should I take over and finish the development?

kbenoit commented 4 years ago

Sure, feel free!

quanteda / quanteda.textstats

Discussion on C++ implementation of keyness #16