Some analysis tasks can be run in parallel. This can be done in do_all_analyses of common-script but requires knowledge about which taks depend on each other. Parallel execution is already possible by starting multiple instances of the analysis with different taks (e.g. one process validate,validate_sqlite and one process completeness,completeness_sqlite) but dependencies among tasks are not checked.
Most individual analysis task can also be speed up by parallel programming, this depends on the type of task. If the task involves parsing the whole input records, one thread should do the parsing and distribute records subsets to other threads.
those tasks which are implemented in Java can be parallelized. Some of them once have been parallelised, and supported Apache Spark API. But it was not updated for a while, and since then there are lots of new functionalities. Moreover: at tome of writing the Spark cooperation the process was simply a map of one input record to one output record. Now each tasks collects information about larger sets of records, the parallelization of which is not as simple, it requires a larger amount of work.
some post processing work are not written in Java but in R, PHP or SQL. Not that they would be problematic, but they might be rewriting in a Spark supported language (R is supported, but in a limited way)
All in all I think it is not a simple ticket, but it should be a "milestone" or "epoch", and should have several children tickets.
Some analysis tasks can be run in parallel. This can be done in
do_all_analyses
ofcommon-script
but requires knowledge about which taks depend on each other. Parallel execution is already possible by starting multiple instances of the analysis with different taks (e.g. one processvalidate,validate_sqlite
and one processcompleteness,completeness_sqlite
) but dependencies among tasks are not checked.Most individual analysis task can also be speed up by parallel programming, this depends on the type of task. If the task involves parsing the whole input records, one thread should do the parsing and distribute records subsets to other threads.