Add support for running tasks concurrently

ryangribble commented 8 years ago

Fixes #62

As discussed in #62 the ScienceAsync method can now specify concurrency settings (how many tasks to run at once). Internally I am splitting the orderedBehaviors into smaller sublists (of ConcurrentTasks size), running the tasks in the sublist in parallel and collecting the observations.

I also added a test that asserts the tasks are in fact running concurrently by running 4 tasks with concurrent settings of 1,2 and 4 and ensuring the time taken to execute the ScienceAsync is as expected (with a small amount of headroom for plumbing code).

Possible enhancements could be an option to run ALL tasks concurrently rather than specifying the number. Also I could easily add support for the task concurrency on the regular Science method too, rather than just ScienceAsync?

As an aside I also found issues with running the existing tests due to some of them not specifying their experimentName in the TestHelper.Results() call. I tweaked the TestHelper to force a experimentName to be passed in rather than relying on the calling code to use a lambda, which seems better than needing to remember :grinning:

Even with that I do sometimes find tests will still fail randomly, I suspect the static `Scientist.ResultPublisher is at fault and tests executing simultaneously are sometimes tripping over eachother on this?

ryangribble commented 8 years ago

Whoops, I didnt realise a couple of other PRs got merged and I had conflicts sorry!

Ive updated to latest master and refactored my changes to suit.

Also FYI I am now running a build of my branch "in the wild", achieving the goal set out in my initial issue #62 (running tasks concurrently so users do not see a changein response time) :+1:

davezych commented 8 years ago

This looks good! Thanks! :star2: (and that's awesome you're already making use of it!)

The only thing that worries me is the test, most specifically the headroom. I don't like tests that have a chance of not being consistent - in the event we're running tests while cpu resources are being used by another process there's a chance we can fall outside of the allowed range and the test will fail. If we increase the range too high, that defeats the purpose of the test because we won't know whether or not they actually ran concurrently. The likelyhood of this happening is probably pretty low, but I still don't necessarily like it.

Can you think of another way of testing this? Perhaps keeping track of which tasks have started and ensuring they all have started before one finishes? (Off the top of my head I'm not sure that'd be any better... but it's an idea)

davezych commented 8 years ago

Even with that I do sometimes find tests will still fail randomly, I suspect the static Scientist.ResultPublisher is at fault and tests executing simultaneously are sometimes tripping over eachother on this?

You're probably right. That's something I've been meaning at looking into and fixing. (hint hint if you want to, go ahead :smile: )

ryangribble commented 8 years ago

Can you think of another way of testing this? Perhaps keeping track of which tasks have started and ensuring they all have started before one finishes? (Off the top of my head I'm not sure that'd be any better... but it's an idea)

Yeah it's pretty tough as any tests are going to be temporal in nature and I guess always susceptible to processing delays like you mention, but do we really need the tests to handle "drastic" CPU load conditions where it isnt able to execute 4 tasks of 1 second sleeps each, in less than 4 seconds plus 50ms headroom? The headroom could be bumped up, even to 500ms perhaps...

I have one idea that I'll try now, that may be more satisfactory and a bit less susceptible to overall processing delays. If I make the "long waited task" being used in the tests grab the current time stamp before it sleeps, then return that timestamp as the task result (instead of the int im returning now), then I can assert that the time difference between each "batch" of tasks starting is at least the duration of the task. eg for 4 tasks with concurrency of 2 we are expecting 2 tasks to be kicked off, then another 2 tasks and we would assert that the "startTime" of tasks 3 and 4 was at least 1 second after tasks 1 and 2 started. This means it doesnt matter how long the tasks actually take and also restricts the margin of error to the time between batches of tasks rather than overall time of (up to 4) tasks running one after another...

ryangribble commented 8 years ago

ive rewritten the test but its still not really that elegant. If you have other ideas let me know but either the 1st way or this way, are the best I can come up with :grinning:

ryangribble commented 8 years ago

Any feedback on this @davezych @Haacked ? Im running my stuff on a locally built/privately hosted nuget feed, but would be nice to move over to the "official" library... but I need my concurrent task change to be merged first obviously :grinning:

davezych commented 8 years ago

This looks better although I have not had any time (and probably won't) to pull it and test it. I'll have to defer to @Haacked for his comments, and since he's the only one that can approve it anyway. :laughing:

haacked commented 8 years ago

Sorry, been real busy. I'll take a look soon.

haacked commented 8 years ago

So this PR doesn't build for me. It just hangs. There's also some error messages even before I build.

Severity Code Description Project File Line Suppression State Error NU1008 "netstandard1.1" is an unsupported framework. Scientist C:\dev\play\Scientist.net\src\Scientist\project.json 6
Error NU1006 Dependencies in project.json were modified. Please run "dnu restore" to generate a new lock file. Scientist C:\dev\play\Scientist.net\src\Scientist\project.lock.json 1
Error '*' is not a valid version string. Scientist.Test C:\dev\play\Scientist.net\test\Scientist.Test\project.json 1

haacked commented 8 years ago

I was thinking that in order for the experiments to be run concurrently, they must be on different threads. So rather than testing the timing component, why not record the ThreadId for each task and just verify that they're all different?

One key thing you have to be sure of is that no task completes before all of them are started. Otherwise that thread could be returned to the threadpool and reused.

Right now, this PR tries to tackle that with a long delay (1000ms). Rather than do that, let's do something deterministic. You could have each task wait on a CountDownEvent set to the number of concurrent tasks we plan to run. That way, we guarantee that none of the tasks complete until they've all been started. And we won't need to rely on timing issues.

joncloud commented 8 years ago

@Haacked about building:

Are you able to build using build.cmd? If so, then it looks like you might need to install the Visual Studio official MSI Installer (direct link) to build inside of Visual Studio.

haacked commented 8 years ago

Thanks @joncloud! That fixed it.

ryangribble commented 8 years ago

Right now, this PR tries to tackle that with a long delay (1000ms). Rather than do that, let's do something deterministic. You could have each task wait on a CountDownEvent set to the number of concurrent tasks we plan to run. That way, we guarantee that none of the tasks complete until they've all been started. And we won't need to rely on timing issues.

Hey @Haacked. Are you saying to use the CountdownEvent just in the test code, or are you actually suggesting to change the internals of ExperimentInstance to do the CountdownTask and signalling etc in there?

I tried to rewrite the test using CountdownTask but I got stuck trying to figure out where to .Reset() the counter.

This is what I had now:

[Theory,
    InlineData(1),
    InlineData(2),
    InlineData(4)]
public async Task RunsTasksConcurrently(int concurrentTasks)
{
    // Control + 3 experiments
    var totalTasks = 1 + 3;

    // Use a CountdownEvent to ensure tasks don't finish before all tasks in that batch have started
    var countdown = new CountdownEvent(concurrentTasks);

    // Our long running task
    var task = new Func<Task<int>>(() => 
    {
        return Task.Run(() =>
        {
            // Decrement the counter
            countdown.Signal();

            // Wait till all tasks for this batch have started
            countdown.Wait();

            // Not safe to reset here since there could be a timing hole with another task in this batch that hasn't hit Wait() yet
            // but if we dont reset here, then where CAN we reset?
            // countdown.Reset();
            return Thread.CurrentThread.ManagedThreadId;
        });
    });

    // Run the experiment
    const string experimentName = nameof(ThrowsArgumentExceptionWhenConcurrentTasksInvalid);
    await Scientist.ScienceAsync<int>(experimentName, concurrentTasks, experiment =>
    {
        // Add our control and experiments
        experiment.Use(task);
        for (int idx = 2; idx <= totalTasks; idx++)
        {
            experiment.Try($"experiment{idx}", task);
        }
    });

    var result = TestHelper.Results<int>(experimentName).First();

    // Organise observations into their concurrent batches
    var batchThreadIds = new Dictionary<int, List<int>>();
    for (int taskNo = 1; taskNo <= totalTasks; taskNo++)
    {
        // Get control or experiment
        var t = taskNo == 1
            ? result.Control
            : result.Candidates.First(x => x.Name == $"experiment{taskNo}");

        // Calculate which batch it was in
        var batch = (int)Math.Ceiling(1D * taskNo / concurrentTasks);

        // Add threadId to batch results
        if (batchThreadIds.ContainsKey(batch))
            batchThreadIds[batch].Add(t.Value);
        else
            batchThreadIds.Add(batch, new List<int>() { t.Value });
    }

    // Now assert each batch has unique thread IDs (proving the tasks ran concurrently)
    for (int batch = 1; batch <= batchThreadIds.Count; batch++)
    {
        Assert.Equal(batchThreadIds[batch].Count, batchThreadIds[batch].Distinct().Count());
    }
}

ryangribble commented 8 years ago

All good! I managed to get it working by using two CountdownEvents and the bool return from .Signal() indicating this was the call that dropped the count to zero... so now it's safe to reset the counter.

ryangribble commented 8 years ago

Ah damn, the test sometimes fails due to the randomised task order, it means I cant assume that Control + Experiment1 are in batch 1, and Experiment 2 +3 are in batch 2 etc. Ill need to figure out a way to know what tasks were in what batch!

ryangribble commented 8 years ago

OK i think it's good now :grinning:

haacked commented 8 years ago

Really nice work! I think that unit test with the CountDownEvent could alone make for a good blog post hint hint!

Thanks!

scientistproject / Scientist.net

Add support for running tasks concurrently #66