Closed thomasrolinger closed 1 year ago
I don’t think I’ll do the approach above but maybe we should do the more general replicated write approach where we do the replication if the writes can be done as reductions. In the end, it would be nice to have a use case for Moldyn
When we say that we can apply replication in the presence of writes when those writes can be formulated as reductions, we need to be clear what this really means. Historically, this is cited in works that were doing automatic parallelization of a loop. Consider the example below that computes the sum of the elements in an array:
var sum = 0;
for i in arr.domain {
sum += arr[i];
}
If we were to automatically parallelize this loop, we would reason that the concurrent updates to sum
can be done via a reduction:
var sum = 0;
forall i in arr.domain with (+ reduce sum) {
sum += arr[i];
}
However, in our case we are explicitly looking at forall
loops that are already parallelized. Consider the following example:
A = 0;
forall i in B.domain {
A[B[i]] += something;
}
Our selective data replication optimization would see that A[B[i]]
is an irregular access and a candidate for replication. Each locale may create a copy of a given A[B[i]]
element, and thus have its own version that it updates with the +=
operator. The argument is that we can reduce these individual A[B[i]]
copies to create the "original" A[B[i]]
that would exist if we didn't do replication. We would want to do this atomically like how an actual sum reduction is done.
However, is it actually necessary to do the reduction atomically? The fact is, the original code above has a race condition by design (from the programmer). Multiple tasks can be incrementing the same A[B[i]]
element at the same time, so there is non-determinism in the result. Our optimization's goal is not to fix/detect race conditions. So why would we ensure the reduction is done atomically if the original program is non-deterministic?
In the case of Moldyn, it is a bit different because the operation we are reducing is already an atomic add/sub. But still, when we do the reduction after the kernel, we don't need to do any extra form of synchronization.
What this all comes down to is the following:
A[B[i]]
is write-only in the loop (i.e., the writes we do are not read in the same iteration), and we assume that the user is well aware of the nature/consequences of concurrent writes, then we can move forward with possible replication. forall
. This is going to be something similar to the executorPreamble()
. We can call it the executorPostamble()
. However, like aggregation, we'll need to generalize the operation that is being performed (better yet, we could just DO aggregation, since that is what we'll want to do anyways).ref t = A[B[i]]
and other potential-writes to the candidate access. We need to bolster this will further checks that ensure the writes are supported (they are not read elsewhere in the loop). We'll also need to adjust our post resolution checks which ensure that A
cannot be modified in the loop.executorPostamble()
after the executor forall
.forall
.With respect to the plan above, this is what we have done thus far:
executorPostamble
that is specific to our use-case in Moldyn with atomics. I also needed to adjust the executorPreambleRecords
(or rather create a copy of it) because when we gather the original values from A
, we need to do atomic reads and atomic writes to the replicated copy. In our older work, we actually made a atomic destination aggregator for this purpose. So to merge this in more smoothly, I don't think we need a separate function for it; we just add the if this is atomic, do this
to the original. To generalize the executorPostamble
, we'll want to do something similar: one version for POD types, another version for records which will handle atomics correctly (we need to do .read()
within the aggregation call.With the above, I got moldyn to run, produce the right answers and give about a 2x speed-up on 4BOG on 8 locales.
The next steps will be to make things a bit smoother and less hacky, and start adding in the proper static checks. For example, the postamble should only pass in the field positions for the fields that are written to in the loop (in a valid way). Right now, our approach finds all reads/writes to the fields and adds them to the global vector. So we'll need a way to differentiate those. Also, if we do require that the elements are initialized with the identity for the operator, we can improve the preamble by just writing that value (rather than atomically reading the original one, in the case of moldyn).
As an aside, maybe we can just restrict this use of the optimization when we are dealing with atomic adds. We know we can do it, and we have a use-case. It will simplify some the checks and code transformations we need to do. I am thinking that is "good enough" to demonstrate the idea. In that way, we don't have to generate an aggregation record type and things like that. We'll still want to ensure that the elements are not read within the loop, etc. But to identify the fields that are written to, it suffices to just find the calls to .add()
on the appropriate fields. We can then do a simple search that looks at each SymExpr
involving the fields and make sure they are either the .add()
within the loop or something completely outside of the loop. We then just need to determine that the fields are set to 0 prior to the forall
.
Going off from the above: I think we're in a good place with the Chapel code doing what it needs to do.
The next step is to add the correct compiler checks:
analyzeCandidateIE
, we originally ensured that if the candidate is something like ref t = A[B[i]]
then it was stored as a const
. We need to relax this, and allow the candidate to go forward assuming we can show that t
is only read within the loop, or any write to t
is in the form of .add()
. I think this will be easier to check during prefolding, since we can actually show that we're dealing with atomics at that point. For now, leave this alone until the end as it doesn't matter to code transformations.isIEOptimizationValid
we check whether we are accessing only POD fields within a record, and we log those POD fields as we see them. We need to augment this so we identify the fields that are atomic and involved within .add()
calls. We store these field positions in a separate vector and use that to build up our call to executorPostambleRecords
. -- DONEexecutorPreambleRecords
is atomic, and therefore, it has a special initialization where we set its value to the add-identity, which is 0. We would want to be sure at this point that all accesses to such atomic fields are through .add()
. -- DONE.add()
. We can do this within the same check that determine they were involved in .add()
; when we do this check, we are explicitly looking at uses of a call_tmp
which holds the field, so our checks are limited to the scope of the forall
. If we see any use of the field that is not .add()
or the expected PRIM_MOVE
which puts the field within a call_tmp
, then it is invalid. This could be more robust by ensuring that the call_tmp
isn't passed into a function and then accessed. But that is a production concern. -- DONEt
is not const ref
and we didn't find any atomic record field accesses, then we abort. -- DONEforall
loop. We can fill this in as time goes on, as it doesn't alter code transformations. It would be great to do this statically, but what would the overhead be to do it at runtime? In other words, insert a loop before the forall
that checks whether the fields are 0. If it finds something that isn't 0, it sets a flag to abort the optimization. It is worth looking at the overhead of that. So far, I am seeing little to no overhead here. So we could create a Chapel routine that iterates over the array, looks at the atomic fields of the record and does the check. It'll return true if we're good, false otherwise. The tricky part is setting up the if-then-else, since we don't want to do this check for all optimization use cases, but only when we have atomic fields used in the ways described above. But we don't know if we have such a case until at least prefolding. So the question is how easily can we create an if-then-else during prefolding? I am thinking the issue will come down to the fact that we need to clone the original forall so that we run that when we don't pass our check. I don't know if that will work nicely during prefolding. So the better option may be to create the if-then-else and somehow blast it away if we don't actually need to do it.Separate post for the last comment above, as it'll require some work.
We want to add a call to something that checks the record fields being the identity (0), and we need to do it during normalization. We can create a Chapel function that does this check.
I did the above but by inserting a dummy call at normalization and then replacing it at prefolding with the call we want. We insert this within the staticCheckCall conditional.
Running all the small tests to make sure we didn't break anything. Also need to run moldyn and make sure the overhead is still more or less 0.
If all goes well, I'll close this. We may want to go back later and do proper checks to ensure things that shouldn't be written in the loop are not written to.
As of now, we abort the optimization if we find
A[B[i]]
on the LHS of an operation, or if we haveref t = A[B[i]]
(it is required toconst ref
). In other words, we do not allow for writes to potentially replicated data.I have an idea that may allow us to not support writes to replicated data exactly, but allow for the optimization to still be applied for the read-only data. This is specifically for cases where
A
stores records and some fields are read in the loop while others are written to.The best example of this is Moldyn. We have something like
The reads to the fields
x
,y
andz
are read-only and from just that, we can replicatedata_j
. However, we then do writes to the fieldsfx_a
,fy_a
andfz_a
.The idea would be to recognize that some fields are read-only and some are not. For the fields that are written to, we replace
data_j
with the originaldata[neighbors[j]]
expression, forcing the access to go through the original copy rather than the replicated one.This may get a bit complicated but it doesn't seem like it would be too difficult. But it does mean we need to turn off our early check that
ref t = A[B[i]]
must beconst ref
. So that adds complications to how we determine whether we have writes or not earlier on in compilation.For now, we can get Moldyn to work within the optimization by not using
data_j
in the writes but just putconst idx = neighbors[j]; data[idx].fx_a...
ourselves.