simple-crypto / SCALib

Side-Channel Analysis Library
GNU Affero General Public License v3.0
74 stars 19 forks source link

Parallelize belief propagation #145

Open cassiersg opened 9 months ago

cassiersg commented 9 months ago

We should exploit parallelism in belief propagation.

(Copied from a discussion on matrix.)

Trying to parallelize a single belief propagation (i.e., multiple calls on the same BPState) with ContextExecutor will not work (I think it will end up executing serially due to a mutex). BP parallelization is a tricky topic to implement well (we've had it in the past, with mixed results: it was performing very well in some cases, but badly in others - depending on parameters such as NC, nexec and the graph size), hence I didn't bother with it when I re-implemented the whole thing, but it would be nice to add it back. As a conservative starting point, we could try to parallelize factors and variables (using rayon's parallel iterators in propagate_loopy_step and propagate_all_vars).

F-Lehmann commented 9 months ago

~An approach that is works in my preliminary tests is to use [https://docs.rs/ndarray/latest/ndarray/parallel/index.html](ndarrays parallel iterators) for at lower level calculations. For example in https://github.com/simple-crypto/SCALib/blob/0cd443027eb5809f3978bdedc0641f0015a41c47/src/scalib_ext/scalib/src/sasca/bp_compute.rs#L150 the azip!(...).for_each() can become .par_for_each() with no to minimal changes to satisfy the borrowchecker. The same works for .mapv_inplace() -> .par_mapv_inplace() at some other places.~

~These changes might definitly not be the ideal solution and only make parts of the bp parallel and introduce more overhead compared to having parallelism on more outer iterators in propagate_loopy_step and propagate_all_vars as suggeseted. But it's a easy place to start while ensuring correctness.~

~Maybe someone more knowledgeble in the math stuff can find way how to parallelize these parts aswell~

Forget all of this, the overhead is insane

cassiersg commented 9 months ago

Your proposal is useful when the distributions are very large (i.e. large NC and nexec), but have a large overhead otherwise. Further, while it works well for distribution multiplications, the bottleneck is typically in other operations (distribution multiplications are O(nexec*nc), while many factors are O(necec*nc*log(nc)), or more). I think that within-operation parallelism can be useful (or even the best parallelization strategy in some cases), but it can lead to huge overheads for some parameter sizes, therefore it is quite tricky to adjust (empirical thresholds everywhere...).

Outer parallelism will probably be easier, because it tends to work with larger chunks of work, hence lower overheads, and also to not be too terrible for cache usage efficiency. It's still not trivial to never regress performance (in particular for small nc and nexec).