qri-io / qri

you're invited to a data party!
https://qri.io
GNU General Public License v3.0
1.1k stars 66 forks source link

feat(automation): `RunQueue` interface to sync & cancel runs #1966

Closed ramfox closed 2 years ago

ramfox commented 2 years ago

First pass at a basic RunQueue interface.

Only 1 queue for both dry runs and runs, no priority.

Once we have a better understanding of how we want to deligate in cloud & how we want the runs scoped (can we run one dry run and one run at a time? Can a user run mulitple runs at a time, or is it scoped one per user) we can expand how we see fit.

--

Also fixes flaky TestRunStoreEvents test!

Interestingly enough, there were two problems with the test both stemming from the same issue: since so much of our run process happens async, events are not guaranteed to emit in the same order at the same time.

In the test as it was, we were comparing the contents of the run store after each emitted transform event. This all happened in the correct order, because we were mocking the run so we could control when the events emitted, and we checked the run store after each event to ensure we were getting the expected run state. We mocked the event.NowFunc so we could accurately diff between the expected run states and the given run states. The issue is that this mocked runFunc was happening in an uncontrolled environment that ALSO used the event.NowFunc outside of what we were immediately testing. We tried to compensate for this by waiting until the last "expected" event occured, but again... events aren't guaranteed to happen synchronously, so this only worked some of the time, and got worse once the ETAutomationRunQueuePop event was added, since it was another event that happened async but close to when the ETTransformStart event occurs (the first event we are interested in seeing the effect off).

Ultimately, I decided to by-pass the whole event.NowFunc issue, by ignoring any time stamp or duration fields in the run state diff. I looked at our other run state tests, and these fields are tested extensively in the run package. Instead, the test now only tests what makes sense at the automationpackage integration level: that if we have a run store & we run a workflow, we receive the transform events during a run, and we are saving that information as expected to the store.

I also removed the "wait" that was in place, waiting for the ETAutomationWorkflowStart event to pass before starting to mock the time. I removed it both because it was no longer needed, but also because it seemed to be causing a lock in the event bus, our second flaky problem: We were publishing the ETAutomationWorkflowStart event very close to where we were subscribing to the same events, but also listening for the ETAutomationWorkflowStart event with our special test handler. I'm not exactly sure what was causing the lock up, but some interaction between these three things was causing a problem. Because we were no longer mocking time, we no longer needed to "wait" for anything before starting the mock, so I didn't look much further into the issue.

Arqu commented 2 years ago

👍 👍 Think this is just things shining through from us really moving into async territory. Bet this is not the last we see of these. Glad we're slowly unwrapping and allowing more free flow for the event bus and async execution.

We'll probably soon need a better mechanic for testing these events, maybe some generic event test function which acts as a collector and we can just specify if we want all/some events and in order or don't care.

Good stuff.