Parallel processing - Githubissues

pkiraly / qa-catalogue

QA catalogue – a metadata quality assessment tool for library catalogue records (MARC, PICA)

GNU General Public License v3.0

76 stars 18 forks source link

Some analysis tasks can be run in parallel. This can be done in do_all_analyses of common-script but requires knowledge about which taks depend on each other. Parallel execution is already possible by starting multiple instances of the analysis with different taks (e.g. one process validate,validate_sqlite and one process completeness,completeness_sqlite) but dependencies among tasks are not checked.

Most individual analysis task can also be speed up by parallel programming, this depends on the type of task. If the task involves parsing the whole input records, one thread should do the parsing and distribute records subsets to other threads.

There are two problems to solve:

those tasks which are implemented in Java can be parallelized. Some of them once have been parallelised, and supported Apache Spark API. But it was not updated for a while, and since then there are lots of new functionalities. Moreover: at tome of writing the Spark cooperation the process was simply a map of one input record to one output record. Now each tasks collects information about larger sets of records, the parallelization of which is not as simple, it requires a larger amount of work.
some post processing work are not written in Java but in R, PHP or SQL. Not that they would be problematic, but they might be rewriting in a Spark supported language (R is supported, but in a limited way)

All in all I think it is not a simple ticket, but it should be a "milestone" or "epoch", and should have several children tickets.

pkiraly / qa-catalogue

Parallel processing #278