qcri / Arabesque

Scalable Graph Mining
http://arabesque.io
Apache License 2.0
61 stars 33 forks source link

Possible deadlock with ODAGs and natural termination of execution. #4

Closed AlexJF closed 8 years ago

AlexJF commented 8 years ago

Currently, when a partition does not process any embeddings, it sets itself as halted.

Halted partitions resume on message receive. This works as expected when using embedding list communication because messages are targeted to specific partitions and thus wake them up. With ODAGs, however, on an expansion phase, messages are sent at the worker level and thus fail to wake up the halted partitions.

But the ODAG reading code assumes that all partitions are running and initialized a barrier with the total number of local partitions.

For the deadlock to activate, some partitions have to halt while others continue computing. This usually doesn't happen if we prune execution at a certain depth (all workers will terminate simultaneously) and if the number of embeddings we are handling is consistently bigger than the number of partitions (makes it very improbable that some partitions be empty while others not).