thomasrolinger commented 2 years ago

As of now, we abort the optimization if we find A[B[i]] on the LHS of an operation, or if we have ref t = A[B[i]] (it is required to const ref). In other words, we do not allow for writes to potentially replicated data.

I have an idea that may allow us to not support writes to replicated data exactly, but allow for the optimization to still be applied for the read-only data. This is specifically for cases where A stores records and some fields are read in the loop while others are written to.

The best example of this is Moldyn. We have something like

forall ... {
   for j in ... {
      ref data_j = data[neighbors[j]];
      ... = data_j.x;
      ... = data_j.y;
      ... = data_j.z;
      if ... {
         data_j.fx_a.sub(...);
         data_j.fy_a.sub(...);
         data_j.fz_a.sub(...);
      }
   }
}

The reads to the fields x, y and z are read-only and from just that, we can replicate data_j. However, we then do writes to the fields fx_a, fy_a and fz_a.

The idea would be to recognize that some fields are read-only and some are not. For the fields that are written to, we replace data_j with the original data[neighbors[j]] expression, forcing the access to go through the original copy rather than the replicated one.

This may get a bit complicated but it doesn't seem like it would be too difficult. But it does mean we need to turn off our early check that ref t = A[B[i]] must be const ref. So that adds complications to how we determine whether we have writes or not earlier on in compilation.

For now, we can get Moldyn to work within the optimization by not using data_j in the writes but just put const idx = neighbors[j]; data[idx].fx_a... ourselves.

thomasrolinger commented 2 years ago

I don’t think I’ll do the approach above but maybe we should do the more general replicated write approach where we do the replication if the writes can be done as reductions. In the end, it would be nice to have a use case for Moldyn

thomasrolinger commented 2 years ago

Thoughts

When we say that we can apply replication in the presence of writes when those writes can be formulated as reductions, we need to be clear what this really means. Historically, this is cited in works that were doing automatic parallelization of a loop. Consider the example below that computes the sum of the elements in an array:

var sum = 0;
for i in arr.domain {
   sum += arr[i];
}

If we were to automatically parallelize this loop, we would reason that the concurrent updates to sum can be done via a reduction:

var sum = 0;
forall i in arr.domain with (+ reduce sum) {
   sum += arr[i];
}

However, in our case we are explicitly looking at forall loops that are already parallelized. Consider the following example:

A = 0;
forall i in B.domain {
   A[B[i]] += something;
}

Our selective data replication optimization would see that A[B[i]] is an irregular access and a candidate for replication. Each locale may create a copy of a given A[B[i]] element, and thus have its own version that it updates with the += operator. The argument is that we can reduce these individual A[B[i]] copies to create the "original" A[B[i]] that would exist if we didn't do replication. We would want to do this atomically like how an actual sum reduction is done.

However, is it actually necessary to do the reduction atomically? The fact is, the original code above has a race condition by design (from the programmer). Multiple tasks can be incrementing the same A[B[i]] element at the same time, so there is non-determinism in the result. Our optimization's goal is not to fix/detect race conditions. So why would we ensure the reduction is done atomically if the original program is non-deterministic?

In the case of Moldyn, it is a bit different because the operation we are reducing is already an atomic add/sub. But still, when we do the reduction after the kernel, we don't need to do any extra form of synchronization.

What this all comes down to is the following:

if A[B[i]] is write-only in the loop (i.e., the writes we do are not read in the same iteration), and we assume that the user is well aware of the nature/consequences of concurrent writes, then we can move forward with possible replication.
We still require that the write operation is something that can be reduced, but not in the sense of synchronization but like we do with aggregation: we need to be sure that the operation can be performed out of order.
In that sense, we also still need to be sure that the initial values we are "reducing" started at the identity for the operation (0 for sum, 1 for multiply, etc.). EDIT: is this really still required, even though we are not really doing a formal reduction?
Then, when we do our reduction loop at the end, we simply perform that operation across the replicated copies without any specific synchronization since that is the intent in the original code. We could be a bit more robust and ensure that there isn't some form of locking/synchronization around the writes (and if so, abandon the optimization).

Plan

Write the Chapel could that we will call to perform the reduction after the forall. This is going to be something similar to the executorPreamble(). We can call it the executorPostamble(). However, like aggregation, we'll need to generalize the operation that is being performed (better yet, we could just DO aggregation, since that is what we'll want to do anyways).
Adjust our static analysis to allow for ref t = A[B[i]] and other potential-writes to the candidate access. We need to bolster this will further checks that ensure the writes are supported (they are not read elsewhere in the loop). We'll also need to adjust our post resolution checks which ensure that A cannot be modified in the loop.
Add the code transformation task/pass that adds the call to executorPostamble() after the executor forall.
Go back and add static checks to ensure things like the elements that are written to are set to the operator identity before the forall.

thomasrolinger commented 2 years ago

Progress

With respect to the plan above, this is what we have done thus far:

I wrote a prototype executorPostamble that is specific to our use-case in Moldyn with atomics. I also needed to adjust the executorPreambleRecords (or rather create a copy of it) because when we gather the original values from A, we need to do atomic reads and atomic writes to the replicated copy. In our older work, we actually made a atomic destination aggregator for this purpose. So to merge this in more smoothly, I don't think we need a separate function for it; we just add the if this is atomic, do this to the original. To generalize the executorPostamble, we'll want to do something similar: one version for POD types, another version for records which will handle atomics correctly (we need to do .read() within the aggregation call.
I commented out the checks we do during normalization that would not allow for things to be on the LHS, etc. This is just so we can get an example to run. We'll need to actually do the proper checks.
I manually inserted calls to the correct preamble and postamble for moldyn. Again, temporary for now.

With the above, I got moldyn to run, produce the right answers and give about a 2x speed-up on 4BOG on 8 locales.

The next steps will be to make things a bit smoother and less hacky, and start adding in the proper static checks. For example, the postamble should only pass in the field positions for the fields that are written to in the loop (in a valid way). Right now, our approach finds all reads/writes to the fields and adds them to the global vector. So we'll need a way to differentiate those. Also, if we do require that the elements are initialized with the identity for the operator, we can improve the preamble by just writing that value (rather than atomically reading the original one, in the case of moldyn).

As an aside, maybe we can just restrict this use of the optimization when we are dealing with atomic adds. We know we can do it, and we have a use-case. It will simplify some the checks and code transformations we need to do. I am thinking that is "good enough" to demonstrate the idea. In that way, we don't have to generate an aggregation record type and things like that. We'll still want to ensure that the elements are not read within the loop, etc. But to identify the fields that are written to, it suffices to just find the calls to .add() on the appropriate fields. We can then do a simple search that looks at each SymExpr involving the fields and make sure they are either the .add() within the loop or something completely outside of the loop. We then just need to determine that the fields are set to 0 prior to the forall.

thomasrolinger commented 1 year ago

Going off from the above: I think we're in a good place with the Chapel code doing what it needs to do.

The next step is to add the correct compiler checks:

During analyzeCandidateIE, we originally ensured that if the candidate is something like ref t = A[B[i]] then it was stored as a const. We need to relax this, and allow the candidate to go forward assuming we can show that t is only read within the loop, or any write to t is in the form of .add(). I think this will be easier to check during prefolding, since we can actually show that we're dealing with atomics at that point. For now, leave this alone until the end as it doesn't matter to code transformations.
During isIEOptimizationValid we check whether we are accessing only POD fields within a record, and we log those POD fields as we see them. We need to augment this so we identify the fields that are atomic and involved within .add() calls. We store these field positions in a separate vector and use that to build up our call to executorPostambleRecords. -- DONE
While doing (2), we will then know whether at least one field that is going into the executorPreambleRecords is atomic, and therefore, it has a special initialization where we set its value to the add-identity, which is 0. We would want to be sure at this point that all accesses to such atomic fields are through .add(). -- DONE
So from the above, we need to have a check somewhere to ensures all uses of these fields are through .add(). We can do this within the same check that determine they were involved in .add(); when we do this check, we are explicitly looking at uses of a call_tmp which holds the field, so our checks are limited to the scope of the forall. If we see any use of the field that is not .add() or the expected PRIM_MOVE which puts the field within a call_tmp, then it is invalid. This could be more robust by ensuring that the call_tmp isn't passed into a function and then accessed. But that is a production concern. -- DONE
We can address (1) at this point: if t is not const ref and we didn't find any atomic record field accesses, then we abort. -- DONE
We also need a way to ensure that the other fields in the record which are non-atomic are read-only in the loop. That sounds like a job for post-resolution. Let's save this for another time, as it doesn't invalid our use-cases right now. There's nothing difficult about doing it, just tedious. -- Do later
The only other piece that is missing is ensuring that the fields that are used in the atomic adds are initialized to 0 prior to the forall loop. We can fill this in as time goes on, as it doesn't alter code transformations. It would be great to do this statically, but what would the overhead be to do it at runtime? In other words, insert a loop before the forall that checks whether the fields are 0. If it finds something that isn't 0, it sets a flag to abort the optimization. It is worth looking at the overhead of that. So far, I am seeing little to no overhead here. So we could create a Chapel routine that iterates over the array, looks at the atomic fields of the record and does the check. It'll return true if we're good, false otherwise. The tricky part is setting up the if-then-else, since we don't want to do this check for all optimization use cases, but only when we have atomic fields used in the ways described above. But we don't know if we have such a case until at least prefolding. So the question is how easily can we create an if-then-else during prefolding? I am thinking the issue will come down to the fact that we need to clone the original forall so that we run that when we don't pass our check. I don't know if that will work nicely during prefolding. So the better option may be to create the if-then-else and somehow blast it away if we don't actually need to do it.

thomasrolinger commented 1 year ago

Separate post for the last comment above, as it'll require some work.

We want to add a call to something that checks the record fields being the identity (0), and we need to do it during normalization. We can create a Chapel function that does this check.

I did the above but by inserting a dummy call at normalization and then replacing it at prefolding with the call we want. We insert this within the staticCheckCall conditional.

Running all the small tests to make sure we didn't break anything. Also need to run moldyn and make sure the overhead is still more or less 0.

If all goes well, I'll close this. We may want to go back later and do proper checks to ensure things that shouldn't be written in the loop are not written to.

thomasrolinger / chapel

Support writes to replicated data #23

Thoughts

Plan

Progress