marius-team / marius

Large scale graph learning on a single machine.
https://marius-project.org
Apache License 2.0
160 stars 45 forks source link

Non-deterministic end-to-end test failures. #113

Open JasonMoho opened 1 year ago

JasonMoho commented 1 year ago

Describe the bug The end to end tests for training and evaluation occasionally fail or timeout, especially when running on Github actions. It's difficult to reproduce this behavior locally. The failures seem to occur most on tests which use async processing + the buffer. This leads me to believe that there is a concurrency control bug (e.g. deadlock) occurring.

The workaround for this bug is to just re-run the tests.

To Reproduce Occasionally can reproduce when running GitHub Actions workflow. E.g. https://github.com/marius-team/marius/actions/runs/3056399004/jobs/4930521831

I have not observed async processing bugs when running on large-scale datasets, only on the tiny-scale datasets used for testing.

The main challenge will be isolating and identifying the issue. My approach will be to run a highly asynchronous configuration on a small dataset, which will hopefully recreate the conditions needed for the concurrency bug to arise.

Environment Occurs on both Linux and MacOS