saurabhnanda / odd-jobs

Haskell job queue with admin UI and loads of other features.
https://www.haskelltutorials.com/odd-jobs/
BSD 3-Clause "New" or "Revised" License
75 stars 29 forks source link

Exposre runner env and pollRunJob for deterministic tests #92

Closed jappeace closed 1 year ago

jappeace commented 1 year ago

This allows you to re-use the default hasRunner instance. Making it easier to for example run the jobpoller function, allowing you to get deterministic tests.

The test suite resource tests are an example of more determininistic tests. Rather then using timeouts we can wait for async handles to complete. This tends to be faster and more reliable. Furthermroe we can make more fine grained assertions about time, because we can explicetly state how many jobs runs should happen in between.

saurabhnanda commented 1 year ago

allowing you to get deterministic tests.

@jappeace could you please give an example of this? Should we be changing anything in the central test-suite as well?

jappeace commented 1 year ago

Hi, this coincided with another change, which split off the pollAndRunJob function from the main forever loop: https://github.com/SupercedeTech/odd-jobs/commit/8d4b0121724a3659f8238f8bd61ed36469392f46#diff-66d90878646443a54362b468cf177416af82bc2ef298f427625af0e63fbdb2abR611

I've not upstreamed that change yet since our fork is becoming rather different (due to the resource changes not being merged yet), but I can add that here as well.

What we did in our internal test suite was getting rid of oddjobs as a background process, and call pollAndRunJob directly on the same thread. This allowed us to get rid of a bunch of timeouts since everything in the test was happening on the same thread. Which in turn made the test suite more reliable (and also faster). I can make these changes to the oddjobs test suite as well if you're interested. I think there are some timeouts at least.

This reliablity issue becomes occurs more often if you start running all tests in parralel and if you have many of them.

ivb-supercede commented 1 year ago

For our internal changes this made a lot of sense because it was simple, but for the public Odd Jobs API I think we should prefer to keep things clean if possible, since otherwise internal API changes will mean version bumps. What do you think if, instead of exposing RunnerEnv, we expose a version of pollRunJob which just takes a Config, and constructs the RunnerEnv internally? e.g.

publicPollRunJob :: Config -> IO (Maybe (Async ()))
publicPollRunJob config = do
  jobThreadsRef <- newIORef []
  runReaderT pollRunJob (RunnerEnv config jobThreadsRef)
saurabhnanda commented 1 year ago

I've not upstreamed that change yet since our fork is becoming rather different (due to the resource changes not being merged yet), but I can add that here as well.

Shall we spend some time on that PR and get it merged? It's an important feature and it would be good to get it upstreamed.

Which in turn made the test suite more reliable (and also faster). I can make these changes to the oddjobs test suite as well if you're interested. I think there are some timeouts at least. This reliablity issue becomes occurs more often if you start running all tests in parralel and if you have many of them.

I felt that it would be better if the test-suite were to run things in parallel because that's how odd-jobs will be used in real-life. Is there any other way to catch concurrency issues during testing?

for the public Odd Jobs API I think we should prefer to keep things clean if possible, since otherwise internal API changes will mean version bumps. What do you think if, instead of exposing RunnerEnv, we expose a version of pollRunJob which just takes a Config, and constructs the RunnerEnv internally? e.g.

In the initial release I had added HasJobRunner for unspecified purposes of "extensibility", and such-like. The more time I spend with Haskell code the more I feel that unnecessary custom Monads are an overhead which make code composability problematic. Now that multiple folks are using odd-jobs, should we answer the question of whether HasJobRunner is even required, or not? (perhaps in a separate thread)?

With respect to the current question at-hand, do we basically have two options to deal with this:

I feel in both the cases we run the risk of being forced to bump-up the version numbers if:

What does SemVer tell us with respect to Internal modules? I see those being used in other Haskell libs as well?

ivb-supercede commented 1 year ago

I agree that in general, it's good to have tests that actually check concurrent behaviours. I don't know that this new functionality would be useful for odd-jobs' test suite.

What does SemVer tell us with respect to Internal modules? I see those being used in other Haskell libs as well?

I don't think the PVP explicitly talks about Internal modules. In principle, they are part of the public API - but most packages are happy to break versioning if they only make "internal" changes. I don't think Hackage ever rejects a package under such a circumstance.

RunnerEnv ends-up getting a new field that cannot be internally constructed by publicPollRunJob (basically the signature of publicPollRunJob changes)

But note that the existence of the CLI ability to start a RunnerEnv means that this will happen anyways. Either:

jappeace commented 1 year ago

After discovering some issues with the resource system, I got re-motivated to try and upstream this. Being able to run a single blocking job allows me to write a test along the lines of:

     it "prevents multiple pack archive jobs running concurrently" $ do
       rpaPackArchiveId <- genKey PackArchiveKey
       _ <- runDB $ W.queueTask $ W.RebuildPackArchive $ W.MkRebuildPackArchiveV2{..}
       waitforever <- newEmptyMVar
       _msync <- runHandler $ runService $ Job.runSingleJobFromQueueWithHandler (const $ takeMVar waitforever)

       _ <- runDB $ W.queueTask $ W.RebuildPackArchive $ W.MkRebuildPackArchiveV2{..}

       msync2 <- runHandler $ runService $ Job.runSingleJobFromQueueWithHandler (const $ takeMVar waitforever)

       -- the second one shouldn't be running now, since the first one is blocked forever
       liftIO $ (() <$ msync2) `shouldBe` Nothing

This is far more simple then dealing with the actual concurency system, and is deterministic. I find this simplicity usefull in my tests.

I've added pollRunJob as well to this MR, which makes it usefull, now you can run it with something like:

runSingleJobFromQueueWithHandler = ...
  let monitorEnv = RunnerEnv
                   { envConfig = config
                   , envJobThreadsRef = r
                   }
  $(logDebug) "running poller"
  liftIO $ runReaderT (pollRunJob "runSingleJobFromQueue" $ readResourceConfig config) monitorEnv
jappeace commented 1 year ago

I've been trying to figure out what's wrong with these changes, but considering the master test suite is broken, I'm sorta grasping at straws. I think we should rewrite the resource tests at least, they're not deterministic, running the resource tests on master 20 times gives me a bunch of random failures:

16 out of 210 tests failed (221.43s)

I used:

  testGroup "concurrency control by many" (testResourceLimitedScheduling appPool jobPool <$ [0..20])
jappeace commented 1 year ago

@saurabhnanda

Shall we spend some time on that PR and get it merged? It's an important feature and it would be good to get it upstreamed.

It's merged

I felt that it would be better if the test-suite were to run things in parallel because that's how odd-jobs will be used in real-life. Is there any other way to catch concurrency issues during testing?

we're testing the same things only in the approach were we have a background worker which may or may not work as expected /in time/ you get spurious failures.

Exposing an async handle and using mvars to block will ensure these tests will be reliable and only fail if there are bugs.

... I've the test suite failing right now on this final test which works locally, is my implementation wrong or is it just randomly failing?

Now that multiple folks are using odd-jobs, should we answer the question of whether HasJobRunner is even required, or not? (perhaps in a separate thread)?

It's not neccisary but I prefer to stick to the initial design rather then keep on changing it.

I feel in both the cases we run the risk of being forced to bump-up the version numbers... Semver

Yes we'd risk more major releases, but that requires us to make actual releases, odd jobs sees such rare release cycle that I don't see any issue with major breaking changes happening meanwhile because it takes years.

(I'm once again volunteering to help with this and get a more frequent release cycle, but you've to actually make me a maintainer for that!)