stfc / RegentParticleDSL

A particle-method DSL based on Regent programming language
MIT License
1 stars 0 forks source link

Still water case - performance #74

Closed LonelyCat124 closed 3 years ago

LonelyCat124 commented 3 years ago

I'd like to look at the still water example for performance - except I'll set velocities/acceleartions artificially to 0 since these I know aren't correct right now.

I plan to run on SFP and see how it goes.

LonelyCat124 commented 3 years ago

Ok, so initial runtime for 64k particles is roughly ~2 seconds per step, which is fine for serial for now (You could definitely get better performance with native C code, however we're lacking a lot of optimisations already which we can look into later, once parallelism exists).

The issue, is adding cpu threads doesn't provide any improvement in performance, which my initial guess is either: 1) The runtime doesn't believe the tasks can be executed in parallel. 2) The runtime can't compute/create the tasks requirements quickly enough to gain anything from parallelism.

There is also potentially the issue that the main task is not an "inner" task, meaning the data has to be mapped in the main task, and this could reduce performance (though the work done in the main task should be complete by the time the code launches).

I'm going to try to run in legion_prof for a single output and see if I can render that profile.

LonelyCat124 commented 3 years ago

Ok, so I messed up my inputs, and the runtime for 27 steps: 1 thread 4 util: 58s 16 thread 4 utils: 34s 28 thread 4 utils: 35s

Looking at the profile, most of the threads are idle, even while (for the most part) the util threads are quiet).

At best I get around 4 cpus active doing tasks, but even then, 2 of them are self tasks (which never have work dependent on other self tasks). I'm going to look at legion spy to see why there are dependencies between tasks.

LonelyCat124 commented 3 years ago

Trying to dig into this now, I have a very small spy + profile setup to look at - having some issues getting legion_prof to read the spy file in so looking at that at the moment.

LonelyCat124 commented 3 years ago

So from looking at the critical path in the profile it appears as though the tasks are perhaps being serialized.

Also the tradequeue section is heavily limited by the util tasks, as the tradequeue tasks are too small to be reasonable, but I'm less worried about those for now.

LonelyCat124 commented 3 years ago

So I attempted to switch to an atomic coherence model, the performance and profiles don't look particularly affected although the critical path seems less followed. I tried adding __demand(__trace) around some loops that should always be identical, but it didn't seem to work as expected. I'm going to try to have another look at that else ask the legion people directly.

LonelyCat124 commented 3 years ago

So it seems like the issue at the moment is that the loop is from 0->x with another 0->y loop inside, and the code expected each of the 0->x loops to contain identical numbers of tasks.

There's a rather awkward workaround for this:

  __demand(__trace)
  for a=0, 1 do
  for x=0, 10 do
    for y = 0, 10 do
        if (x - y < 2) and (x - y > -2) then
            var bit : int2d = int2d({x, y})
            test_task([things_partition][bit])
        end
    end
  end
  end

which creates an extra "loop" (that is always a single loop) and trace that loop around the actual code. This behaviour feels strange (as it feels more like its tracing the inner loop) so I'll ask the Legion team about it. Its easy enough to add this bogus loop but it feels ugly.

LonelyCat124 commented 3 years ago

Ok, so even with tracing and atomic coherence mode the tasks basically execute in serial, despite there not being dependencies shown between them by legion spy, and the dependency analysis being complete in advance.

My assumption is the tasks being on the same region somehow prevents their launch, but I'm not clear why that would be the case. I'm going to for now wonder if its something to do with how atomic coherence works and try running again with exclusive model and see how that changes stuff.

LonelyCat124 commented 3 years ago

Ok, so I hadn't switched the self-particle tasks back to exclusive coherence. Doing that now to gain a new profile to analyse, but essentially the parallelism appears similar but with dependencies showing between tasks.

With the exclusive coherence back on, most of the tasks are limited by dependencies - the ready time for every task is very low (<10 us) which means they're executing almost as soon as they become ready.

I could try changing back to launching all the self tasks first, with exclusive and then atomic coherence and see if that will be sufficient to run at least those tasks in parallel.

LonelyCat124 commented 3 years ago

image So launching the self tasks first works pretty well - some are light weight so parallelism isn't perfect but its good.

I need to investigate what happens if I use atomic coherence model next.

LonelyCat124 commented 3 years ago

Ok, with atomic coherence (which i expected to be less strict, but apparently is not) these tasks are re-serialised: image

I'll have to have a go with a very trivial example to see how this works, but this is not the behaviour I was expecting since I expected atomicity to be for partition elements, not the full region.

LonelyCat124 commented 3 years ago

So it looks like atomic is locking the whole region. The solution here is twofold:

1) Use += syntax where possible. 2) Be able to analyse the += -= *= etc. syntax to create reduction clauses. This means #45 is probably critical to performance asap.

LonelyCat124 commented 3 years ago

Parallel performance is now much better as #77 merged