Closed majosm closed 2 months ago
In build_global_storage_to_sweep_map
(called from ArrayToBufferMap
, and ultimately from precompute_for_single_kernel
), I'm seeing some strange(?) behavior in the call to intersect_range
(code).
The inputs/outputs look like this:
# Before call to intersect_range
global_stor2sweep=Map("{ [_loopy_storage_0', _loopy_storage_1'] -> [idof_ensm5 = _loopy_storage_1', iel_ensm5 = _loopy_storage_0'] }")
domain_dup_sweep=BasicSet("{ [idof_ensm5, iel_ensm5] : iel_ensm5 = 0 and 0 <= idof_ensm5 <= 3 }")
# After call to intersect_range
global_stor2sweep=Map("{ [_loopy_storage_0' = 0, _loopy_storage_1'] -> [idof_ensm5 = _loopy_storage_1', iel_ensm5 = 0] : 0 <= _loopy_storage_1' <= 3 }")
Both _loopy_storage_0'
and iel_ensm5
are length-1 loops here. It looks like it's losing the equality between _loopy_storage_0'
and iel_ensm5
and replacing it with 0?
From what I can tell what happens subsequently is that ArrayToBufferMap
computes the storage bounds and gets:
# iel axis; should be Variable('iel_ensm5')?
base_index=0
# idof axis
base_index=Variable('idof_ensm5')
and from there iel_ensm5
gets lost.
If I comment out the call to intersect_range
, I no longer get the loop nest error and the simulation completes successfully. That leads me to believe that this is in fact the culprit. I don't know how to fix it though (I'm assuming just commenting it out isn't sufficient). @inducer Thoughts?
@inducer I tried generalizing the FusionContractorArrayContext
to handle loops without an element axis. Dealing with the loop splitting was straightforward; however, the topo sorting, barrier inserting, and temporary aliasing steps also depend on having an element loop, and generalizing these has proved to be more annoying. So, I've gone back to looking at this on the loopy side to see if there's anything we could do here instead that would be simpler.
I was thinking, if we could somehow get away with telling ArrayToBufferMap
less about the bounds of the non-swept inames, then it might not be able to optimize them out? I'm still struggling somewhat to understand the precompute_for_single_kernel
code, but I took a stab at doing this (code). Surprisingly, it works for the prediction case. It fails on some of the test_apps
tests, though:
# compute bounds for each storage axis
storage_domain = bounds_footprint_map.domain().coalesce()
if not storage_domain.is_bounded():
> raise RuntimeError("sweep did not result in a bounded storage domain")
E RuntimeError: sweep did not result in a bounded storage domain
../loopy/transform/array_buffer_map.py:183: RuntimeError
I suspect the difference in behavior has something to do with sweep_inames
being empty for the transformations in the prediction case vs. not empty here. I just wanted to get your thoughts on whether doing something along these lines could potentially be viable, or if going through and reworking the array context transformations is the only way?
illinois-ceesd/mirgecom#944 seems to be caused by a loss of certain length-1 loops during loopy processing.
(Creating an issue for discussion. More details to follow.)