inducer / loopy

A code generator for array-based code on CPUs and GPUs
http://mathema.tician.de/software/loopy
MIT License
565 stars 71 forks source link

Loss of length-1 inames #809

Closed majosm closed 2 months ago

majosm commented 9 months ago

illinois-ceesd/mirgecom#944 seems to be caused by a loss of certain length-1 loops during loopy processing.

(Creating an issue for discussion. More details to follow.)

majosm commented 9 months ago

In build_global_storage_to_sweep_map (called from ArrayToBufferMap, and ultimately from precompute_for_single_kernel), I'm seeing some strange(?) behavior in the call to intersect_range (code).

The inputs/outputs look like this:

# Before call to intersect_range
global_stor2sweep=Map("{ [_loopy_storage_0', _loopy_storage_1'] -> [idof_ensm5 = _loopy_storage_1', iel_ensm5 = _loopy_storage_0'] }")
domain_dup_sweep=BasicSet("{ [idof_ensm5, iel_ensm5] : iel_ensm5 = 0 and 0 <= idof_ensm5 <= 3 }")

# After call to intersect_range
global_stor2sweep=Map("{ [_loopy_storage_0' = 0, _loopy_storage_1'] -> [idof_ensm5 = _loopy_storage_1', iel_ensm5 = 0] : 0 <= _loopy_storage_1' <= 3 }")

Both _loopy_storage_0' and iel_ensm5 are length-1 loops here. It looks like it's losing the equality between _loopy_storage_0' and iel_ensm5 and replacing it with 0?

From what I can tell what happens subsequently is that ArrayToBufferMap computes the storage bounds and gets:

# iel axis; should be Variable('iel_ensm5')?
base_index=0

# idof axis
base_index=Variable('idof_ensm5')

and from there iel_ensm5 gets lost.

If I comment out the call to intersect_range, I no longer get the loop nest error and the simulation completes successfully. That leads me to believe that this is in fact the culprit. I don't know how to fix it though (I'm assuming just commenting it out isn't sufficient). @inducer Thoughts?

majosm commented 9 months ago

@inducer I tried generalizing the FusionContractorArrayContext to handle loops without an element axis. Dealing with the loop splitting was straightforward; however, the topo sorting, barrier inserting, and temporary aliasing steps also depend on having an element loop, and generalizing these has proved to be more annoying. So, I've gone back to looking at this on the loopy side to see if there's anything we could do here instead that would be simpler.

I was thinking, if we could somehow get away with telling ArrayToBufferMap less about the bounds of the non-swept inames, then it might not be able to optimize them out? I'm still struggling somewhat to understand the precompute_for_single_kernel code, but I took a stab at doing this (code). Surprisingly, it works for the prediction case. It fails on some of the test_apps tests, though:

        # compute bounds for each storage axis
        storage_domain = bounds_footprint_map.domain().coalesce()

        if not storage_domain.is_bounded():
>           raise RuntimeError("sweep did not result in a bounded storage domain")
E           RuntimeError: sweep did not result in a bounded storage domain

../loopy/transform/array_buffer_map.py:183: RuntimeError

I suspect the difference in behavior has something to do with sweep_inames being empty for the transformations in the prediction case vs. not empty here. I just wanted to get your thoughts on whether doing something along these lines could potentially be viable, or if going through and reworking the array context transformations is the only way?