inducer / loopy

A code generator for array-based code on CPUs and GPUs
http://mathema.tician.de/software/loopy
MIT License
588 stars 73 forks source link

Vectorization error with predicates depending on vectorized loop #615

Open kaushikcfd opened 2 years ago

kaushikcfd commented 2 years ago

The following kernel --

knl = lp.make_kernel(
    "{[i, j]: 0<=i<100 and 0<=j<4}",
    """
    for i
        for j
            <> tmp1[j] = i+j
            <> tmp2[j] = 0
            if j
                tmp2[j] = 2 * tmp1[j]
            end
            out[i, j] = 2*tmp2[j]
        end
    end
    """, seq_dependencies=True)

knl = lp.tag_array_axes(knl, "tmp1,tmp2", "vec")
knl = lp.tag_inames(knl, "j:vec")

print(lp.generate_code_v2(knl).device_code())

generates:

__kernel void __attribute__ ((reqd_work_group_size(1, 1, 1))) loopy_kernel(__global int *__restrict__ out)
{
  int4 tmp1;
  int4 tmp2;

  for (int i = 0; i <= 99; ++i)
  {
    tmp1.s0 = i;
    tmp1.s1 = i + 1;
    tmp1.s2 = i + 2;
    tmp1.s3 = i + 3;
    tmp2 = 0;
    if (j)
      tmp2 = 2 * tmp1;
    out[4 * i] = 2 * tmp2.s0;
    out[1 + 4 * i] = 2 * tmp2.s1;
    out[2 + 4 * i] = 2 * tmp2.s2;
    out[3 + 4 * i] = 2 * tmp2.s3;
  }
}

Notice the stray (j) in the conditional. A short term solution which is not vectorizing such instructions will be included as a part of #557.

/cc @sv2518

kaushikcfd commented 2 years ago

@inducer: I think this is unavoidable for the kernels that @sv2518 is working with. Since changing the logic for emitting predicates in loopy.codegen.instruction would be too intrusive, I propose that loop distribution would give us an easy fallback as follows:

tunit = lp.make_kernel(
    "{[i, j]: 0<=i<100 and 0<=j<4}",
    """
    for i
        for j
            <> tmp1[j] = i+j
            <> tmp2[j] = 0
            if j>2
                tmp2[j] = 2 * tmp1[j]  {id=w_tmp2}
            end
            out[i, j] = 2*tmp2[j]
        end
    end
    """, seq_dependencies=True)

tunit = lp.tag_array_axes(tunit, "tmp1,tmp2", "vec")
tunit = lp.tag_inames(tunit, "j:vec")

# {{{ fallback -->

knl = lp.distribute_loops(tunit.default_entrypoint,
                          "id:w_tmp2",
                          outer_inames=frozenset("i"))
renamed_j, = knl.id_to_insn["w_tmp2"].within_inames - {"i"}
knl = lp.untag_inames(knl, renamed_j, VectorizeTag)
knl = lp.tag_inames(knl, {renamed_j: "unr"})

# }}}

tunit = tunit.with_kernel(knl)
print(lp.generate_code_v2(tunit).device_code())

With the potential use-case for loop distribution, any opinions on (me) moving forward with it?

kaushikcfd commented 2 years ago

@sv2518: If you want to access this functionality, please checkout the branch cvec_x_distribute_loops.

sv2518 commented 2 years ago

Thanks!

sv2518 commented 2 years ago

cvec_x_distribute_loops is not updated to the new version of the c_vecextensions_target, is it?

kaushikcfd commented 2 years ago

cvec_x_distribute_loops is not updated to the new version of the c_vecextensions_target, is it?

Oops, yep. Messed up while cherry-picking the commits. Pushed a fix.

sv2518 commented 2 years ago

Thanks!

Is the fallback code meant to go in our codebase or can this be done in loopy?

I added this snippet to PyOP2

        all_insn_cinsn = list(insn for insn in wrapper.default_entrypoint.instructions  if isinstance(insn, lp.CInstruction))
        # {{{ fallback -->
        for insn in all_insn_cinsn:
            wrapper = lp.distribute_loops(wrapper.default_entrypoint,
                                            insn.id,
                                            outer_inames=shifted_iname)
            renamed_j, = wrapper.id_to_insn[insn.id].within_inames - shifted_iname
            wrapper = lp.untag_inames(wrapper, renamed_j, VectorizeTag)
            wrapper = lp.tag_inames(wrapper, {renamed_j: "unr"})

        # }}}

but I run into the following error. Did I drive this wrong?

../PyOP2/pyop2/global_kernel.py:501: in vectorise
    wrapper = lp.distribute_loops(wrapper.default_entrypoint,
../loopy/loopy/transform/iname.py:1179: in wrapper
    transformed_kernel = transformation_func(kernel, *args, **kwargs)
../loopy/loopy/transform/loop_distribution.py:337: in distribute_loops
    within = match.parse_match(insn_match)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

expr = 'mtf__cond'

    def parse_match(expr):
        """Syntax examples::

        * ``id:yoink and writes:a_temp``
        * ``id:yoink and (not writes:a_temp or tag:input)``
        """
        if not expr:
            return All()

        def parse_terminal(pstate):
            next_tag = pstate.next_tag()
            if next_tag is _id:
                result = Id(pstate.next_match_obj().group(1))
                pstate.advance()
                return result
            elif next_tag is _tag:
                result = Tagged(pstate.next_match_obj().group(1))
                pstate.advance()
                return result
            elif next_tag is _writes:
                result = Writes(pstate.next_match_obj().group(1))
                pstate.advance()
                return result
            elif next_tag is _reads:
                result = Reads(pstate.next_match_obj().group(1))
                pstate.advance()
                return result
            elif next_tag is _in_kernel:
                result = InKernel(pstate.next_match_obj().group(1))
                pstate.advance()
                return result
            elif next_tag is _iname:
                result = Iname(pstate.next_match_obj().group(1))
                pstate.advance()
                return result
            else:
                pstate.expected("terminal")

        def inner_parse(pstate, min_precedence=0):
            pstate.expect_not_end()

            if pstate.is_next(_not):
                pstate.advance()
                left_query = Not(inner_parse(pstate, _PREC_NOT))
            elif pstate.is_next(_openpar):
                pstate.advance()
                left_query = inner_parse(pstate)
                pstate.expect(_closepar)
                pstate.advance()
            else:
                left_query = parse_terminal(pstate)

            did_something = True
            while did_something:
                did_something = False
                if pstate.is_at_end():
                    return left_query

                next_tag = pstate.next_tag()

                if next_tag is _and and _PREC_AND > min_precedence:
                    pstate.advance()
                    left_query = And(
                            (left_query, inner_parse(pstate, _PREC_AND)))
                    did_something = True
                elif next_tag is _or and _PREC_OR > min_precedence:
                    pstate.advance()
                    left_query = Or(
                            (left_query, inner_parse(pstate, _PREC_OR)))
                    did_something = True

            return left_query

        if isinstance(expr, MatchExpressionBase):
            return expr

        from pytools.lex import LexIterator, lex, InvalidTokenError
        try:
            pstate = LexIterator(
                [(tag, s, idx, matchobj)
                 for (tag, s, idx, matchobj) in lex(_LEX_TABLE, expr,
                     match_objects=True)
                 if tag is not _whitespace], expr)
        except InvalidTokenError as e:
            from loopy.diagnostic import LoopyError
>           raise LoopyError(
                    "invalid match expression: '{match_expr}' ({err_type}: {err_str})"
                    .format(
                        match_expr=expr,
                        err_type=type(e).__name__,
                        err_str=str(e)))
E           loopy.diagnostic.LoopyError: invalid match expression: 'mtf__cond' (InvalidTokenError: at index 0: ...mtf__cond...)

../loopy/loopy/match.py:403: LoopyError
inducer commented 2 years ago

I think this is unavoidable

I'm not sure I agree. I think we can expect the predicate to be an expression, and thus we can determine that there's a dependency on that iname in the predicates. Based on that, we just need to raise Unvectorizable somewhere in the target.

kaushikcfd commented 2 years ago

we just need to raise Unvectorizable somewhere in the target.

We did that in https://github.com/inducer/loopy/pull/617. The issue has to do more with the way predicates are emitted in the codegen pipeline. It needs to be taught about UnvectorizableErrors.

Did I drive this wrong?

I think there is a minor error there. I think it should be like:

from loopy.math import Id, Or
cinsn_ids = [cinsn.id
             for cinsn in kernel.instructions
             if (isinstance(cinsn, lp.CInstruction) and cinsn.predicates)]
cinsn_match = Or(tuple(Id(cinsn_id) for cinsn_id in cinsns_ids))
outer_inames = frozenset([shifted_iname+"_outer"])
kernel = lp.distribute_loops(kernel,
                             cinsn_match,
                             outer_inames=outer_inames)
inames_to_untag = [kernel.id_to_insn[cinsn_id].within_inames - outer_inames
                   for cinsn_id in cinsn_ids]
kernel = lp.untag_inames(kernel, inames_to_untag, VectorizeTag)
kernel = lp.tag_inames(kernel, {iname_to_untag: "unr"
                                for iname_to_untag in inames_to_untag})
sv2518 commented 2 years ago

Ah yes I got further now, but now I error with

../PyOP2/pyop2/global_kernel.py:502: in vectorise
    kernel = lp.distribute_loops(kernel,
../loopy/loopy/transform/iname.py:1179: in wrapper
    transformed_kernel = transformation_func(kernel, *args, **kwargs)
../loopy/loopy/transform/loop_distribution.py:400: in distribute_loops
    if not _is_loop_distribution_sound(kernel,
../loopy/loopy/transform/loop_distribution.py:133: in _is_loop_distribution_sound
    amaps_pred = [
../loopy/loopy/transform/loop_distribution.py:134: in <listcomp>
    process_amap(get_insn_access_map(kernel, insn_id, var))
../loopy/loopy/kernel/tools.py:2166: in get_insn_access_map
    indices = list(_IndexCollector(var)((insn.expression,
../../lib/python3.9/site-packages/pymbolic/mapper/__init__.py:244: in __call__
    result = super().rec(expr, *args, **kwargs)
../../lib/python3.9/site-packages/pymbolic/mapper/__init__.py:151: in __call__
    return self.map_foreign(expr, *args, **kwargs)
../../lib/python3.9/site-packages/pymbolic/mapper/__init__.py:206: in map_foreign
    return self.map_tuple(expr, *args, **kwargs)
../../lib/python3.9/site-packages/pymbolic/mapper/__init__.py:359: in map_list
    return self.combine(self.rec(child, *args, **kwargs) for child in expr)
../loopy/loopy/kernel/tools.py:2127: in combine
    return reduce(operator.or_, values, frozenset())
../../lib/python3.9/site-packages/pymbolic/mapper/__init__.py:359: in <genexpr>
    return self.combine(self.rec(child, *args, **kwargs) for child in expr)
../../lib/python3.9/site-packages/pymbolic/mapper/__init__.py:244: in __call__
    result = super().rec(expr, *args, **kwargs)
../../lib/python3.9/site-packages/pymbolic/mapper/__init__.py:153: in __call__
    return method(expr, *args, **kwargs)
../../lib/python3.9/site-packages/pymbolic/mapper/__init__.py:314: in map_quotient
    self.rec(expr.numerator, *args, **kwargs),
../../lib/python3.9/site-packages/pymbolic/mapper/__init__.py:244: in __call__
    result = super().rec(expr, *args, **kwargs)
../../lib/python3.9/site-packages/pymbolic/mapper/__init__.py:153: in __call__
    return method(expr, *args, **kwargs)
../../lib/python3.9/site-packages/pymbolic/mapper/__init__.py:281: in map_call
    (self.rec(expr.function, *args, **kwargs),)
../../lib/python3.9/site-packages/pymbolic/mapper/__init__.py:244: in __call__
    result = super().rec(expr, *args, **kwargs)
../../lib/python3.9/site-packages/pymbolic/mapper/__init__.py:148: in __call__
    return self.handle_unsupported_expression(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <loopy.kernel.tools._IndexCollector object at 0x15075e700>
expr = ResolvedFunction(Variable('sqrt')), args = (), kwargs = {}

    def handle_unsupported_expression(self, expr, *args, **kwargs):
        """Mapper method that is invoked for
        :class:`pymbolic.primitives.Expression` subclasses for which a mapper
        method does not exist in this mapper.
        """

>       raise UnsupportedExpressionError(
                "{} cannot handle expressions of type {}".format(
                    type(self), type(expr)))
E       pymbolic.mapper.UnsupportedExpressionError: <class 'loopy.kernel.tools._IndexCollector'> cannot handle expressions of type <class 'loopy.symbolic.ResolvedFunction'>

../../lib/python3.9/site-packages/pymbolic/mapper/__init__.py:122: UnsupportedExpressionError

The CInstruction does not contain an sqrt function, but the instruction it depends on does.

kaushikcfd commented 2 years ago

The CInstruction does not contain an sqrt function, but the instruction it depends on does.

It needed a map_resolved_function. Pushed a fix to the branch.

sv2518 commented 2 years ago

Okay, cool, thanks! Now one step further I run into

../PyOP2/pyop2/global_kernel.py:502: in vectorise
    kernel = lp.distribute_loops(kernel,
../loopy/loopy/transform/iname.py:1179: in wrapper
    transformed_kernel = transformation_func(kernel, *args, **kwargs)
../loopy/loopy/transform/loop_distribution.py:400: in distribute_loops
    if not _is_loop_distribution_sound(kernel,
../loopy/loopy/transform/loop_distribution.py:138: in _is_loop_distribution_sound
    amaps_succ = [
../loopy/loopy/transform/loop_distribution.py:139: in <listcomp>
    process_amap(get_insn_access_map(kernel, insn_id, var))
../loopy/loopy/kernel/tools.py:2167: in get_insn_access_map
    indices = list(_IndexCollector(var)((insn.expression,
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = CInstruction(groups=frozenset(), code='break;', read_variables=frozenset(), tags=frozenset(), within_inames_is_final=T...tf__atol_crit_id'}), no_sync_with=frozenset(), within_inames=frozenset({'mtf__i_c', 'n_shift_outer', 'n_shift_batch'}))
name = 'expression'

    def __getattr__(self, name):
        # This method is implemented to avoid pylint 'no-member' errors for
        # attribute access.
>       raise AttributeError(
                "'{}' object has no attribute '{}'".format(
                    self.__class__.__name__, name))
E       AttributeError: 'CInstruction' object has no attribute 'expression'

../../lib/python3.9/site-packages/pytools/__init__.py:449: AttributeError
kaushikcfd commented 2 years ago

Sorry, hadn't accounted for CInstruction. Pushed a fix that runs the following snippet as expected:

import loopy as lp
from loopy.symbolic import parse

tunit = lp.make_kernel(
    "{[i]: 0<=i<4}",
    ["<> tmp[i] = 0 {id=w_tmp}",
     lp.CInstruction(iname_exprs=("i", "i"),
                     code="break;",
                     predicates={parse("tmp[i] > n")},
                     read_variables={"i", "n"},
                     depends_on=frozenset({"w_tmp"}),
                     id="break",),
     lp.Assignment("out_callee",
                   "i",
                   depends_on=frozenset(["break"]))
     ],
    [lp.ValueArg("n", dtype="int32"), ...],
    name="circuit_breaker")
knl = tunit.default_entrypoint

knl = lp.tag_inames(knl, "i:vec")
knl = lp.distribute_loops(knl, "id:break", frozenset())
knl = lp.untag_inames(knl, "i_1", lp.VectorizeTag)
knl = lp.tag_inames(knl, "i_1:unr")

tunit = tunit.with_kernel(knl)
print(lp.generate_code_v2(knl).device_code())
sv2518 commented 2 years ago

Fixed as part of https://github.com/inducer/loopy/pull/557