pyiron / pyiron_workflow

Graph-and-node based workflows
BSD 3-Clause "New" or "Revised" License
11 stars 1 forks source link

Running with an executor _sometimes_ hits a bug #48

Open liamhuber opened 11 months ago

liamhuber commented 11 months ago

It's quite infrequent (< 1 in 10), but sometimes this test fails in the following way:

======================================================================
FAIL: test_with_executor (test_macro.TestMacro.test_with_executor)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "D:\a\pyiron_workflow\pyiron_workflow\tests\unit\test_macro.py", line 522, in test_with_executor
    self.assertIs(
AssertionError: <pyiron_workflow.macro.Macro object at 0x0000021B611DB610> is not None : Returned nodes should get the macro as their parent

----------------------------------------------------------------------

I.e. when a composite is run on an executor and copies of its children are returned as the result, these new children are somehow failing to get the parent-child relationship set up reflexively (which is the expected behaviour). I vaguely suspect it has something to do with Future.result() returning new instances each call, or some sort of delayed execution that I am wrongly assuming is happening synchronously, or similar, but working with executors is pretty new to me and the error is not consistently reproducible, so I haven't managed to track it down yet.

liamhuber commented 11 months ago

Happened again today. Was on the Windows 3.11 CI (I want to start noting this here in case there is some pattern of where it comes up, e.g. on windows)

liamhuber commented 10 months ago

Happened again on the windows 3.11 CI

1/10 seems like an overestimate of how often it crops up, and so far really seems like a windows thing

liamhuber commented 10 months ago

Happened on Ubuntu 3.11 today, so I guess it's not just windows.

liamhuber commented 10 months ago

Happened on Ubuntu 3.11 today.

I can't remember the last time it happened on my local machine, so maybe it's exclusively an issue in the CI environment?

liamhuber commented 10 months ago

Seems to be happening consistency today in #70, so maybe there's a chance to nail this down

liamhuber commented 10 months ago

Ok, I added a little sleep and then suddenly the CI passed, so maybe this is fixed after all? Could it be simply that the callback function is executing to slowly on the CI (but fast enough on my local machine) that the test comes up before the state is ready for it?

returned_nodes = result.result()  # Wait for the process to finish
        from time import sleep  # The "fixing" change (so far)
        sleep(1)  # The "fixing" change (so far)
        self.assertIsNot(
            original_one,
            returned_nodes.one,
            msg="Executing in a parallel process should be returning new instances"
        )
        self.assertIs(
            macro,
            macro.nodes.one.parent,
            msg="Returned nodes should get the macro as their parent"
            # Once upon a time there was some evidence that this test was failing
            # stochastically, but I just ran the whole test suite 6 times and this test
            # 8 times and it always passed fine, so maybe the issue is resolved...
        )
liamhuber commented 9 months ago

Finally happened again today (windows), but it's been over a month, so I guess the sleep call is doing the trick.

liamhuber commented 8 months ago

Again on windows today. Pretty rare, but horrible that it exists at all. Adding the sleep when collecting results really seems to have helped, but it's already a full second of extra waiting -- more would be a big pain, but is this going to have horrible scaling when the amount of data on the nodes increases? I still find this bug overall very disconcerting.

liamhuber commented 6 months ago

It finally happened on windows again! #241. It's been a while.

liamhuber commented 6 months ago

Again today on windows on #773. Maybe the months long stretch was just some kind of good luck?

liamhuber commented 3 months ago

The interval is nice and long, but it is still happening. Windows running on merge to main. https://github.com/pyiron/pyiron_workflow/actions/runs/9420579496/job/25952940926

liamhuber commented 3 weeks ago

Again after two months. Must be a thread safety thing?