Optimize keyword argument methods using call-target-specific arguments

wildmaples commented 3 years ago

Background

Keyword arguments are being used more often and we'd like to optimize it. There was a previous implementation that was reverted but we're trying to re-implement as according to the paper, Call-target-specific Method Arguments, but possibly a little more simply.

The specific method arguments we are targeting are the methods with keyword arguments without any rest arguments, which is 0.03% of dynamic method calls. However, that only includes internal/core method calls. So we ran a benchmark on storefront-renderer to see the percentage of method calls that can be optimized per request and it seems like it comes to a low percentage -- 0.56% of total dynamic method calls can be optimized in this case.

However, we agreed that this change is worth doing because it doesn't require too much effort, it's forward thinking, and it should be relatively easy to do. What do you think?

Implementation

The optimization boils down to saving on the allocation of a hash for the keyword arguments case. Typically, a hash would be created to store the key-value pairs and then passed to the callee to unpack, meaning the hash always escapes if the call is not inlined.

Our goal is to optimize that process by evaluating, flattening, and then packaging the arguments into the argument array that'll be passed to the callee. A marker will be added to inform the callee if the packed argument array is expandable/optimized. With that, we'll also need a map of the index for the location of each of the arguments in the array. When the callee receives the arguments, it'll unpack the arguments based on the index map.

So an argument hash won't have to be allocated per call. Additionally, we can polymorphic inline cache the method lookup based on the types of arguments passed to the callee.

eregon commented 3 years ago

Additionally, we can polymorphic inline cache the method lookup based on the types of arguments passed to the callee.

I think method lookup should remain independent of arguments. But the actual calling of a method can be different if the target method accepts keyword arguments.

There are multiple designs possible here. I don't really like the marker approach (extra checks typically) or approaches which need to handle the arguments being either packed or not (makes the AST & compiled code even larger), or approaches that only work if both the caller and callee have explicit keyword arguments (it should work if the caller uses foo(**h) too).

I think we should rather change the calling convention for methods accepting keyword arguments. That can be changed in CallInternalMethodNode/CallBlockNode (where we know which CallTarget we call), and the ReadKeyword*Argument*Node on the callee side. For instance if one has:

def foo(a, b, c, d:, e: 42, f: nil, **g)
end

foo(1, 2, 3, e: 10, d: 4)

We could pass in the Object[] frame argument array something like:

1, 2, 3, 4, 10, undefined, undefined/null

where the order is simply decided by the order of keyword arguments in the method (which we can record while translating the AST).

If an optional kwarg is not passed, we would pass a sentinel value, and compute in the callee (required for semantics).

For a missing required kwarg (d:), we would probably use a sentinel too, and raise the error in the callee to get the expected backtrace (which should include the callee).

For **kwrest, if no extra kwargs are passed it seems nice to use a sentinel too (e.g., undefined or null), so that the empty Hash is allocated in the callee, and there is no Hash allocated (in compiled code) if it doesn't escape from the callee.

This approach has the slight disadvantage of potentially emitting multiple times code to extract keywargs if the call site is polymorphic to multiple method accepting kwargs. I don't think it's a big deal.

This approach also works if the passed keyword arguments come from foo(1, 2, 3, **h), where it will then unpack that h in the caller, which seems ideal.

eregon commented 3 years ago

Note that because of Ruby semantics, if we see foo(a: 1, b: 2) it doesn't mean foo accepts keyword arguments, it could be def foo(h) or def foo(**h) or def foo(a:, b:) (or even def foo(a:, **kw)). If we do this by looking at the callee CallTarget, it's no trouble and we can do the right thing (create a Hash for the first 2 cases, pass only the values for the 3rd case, a mix for the 4th case).

In Ruby 3 it becomes possible to differentiate the first 2 cases, so we'll need to think about some way to do that. Probably if we already have a special calling convention for keyword arguments we can use that to know/mark if a given Hash was positional or keywords.

eregon commented 3 years ago

I think the approach I suggest is quite similar to the paper (but a lot of details were different back then, most notably the dispatch chain was done manually without the DSL and did lookup+call together, while now it's lookup, then call which is more flexible), with the difference of avoiding the Pitfalls and Implementation Details which is about handling the arguments being either packed or not and something that seems best to avoid, so the callee always receives arguments packed as planned and does not need to handle receiving them as a single Hash.

BTW implementing this feature should also help the current problem of having multiple ReadUserKeywordsHashNode in the callee, which duplicates logic and can result in multiple to_hash call if kwargs are not already a Hash (we'd handle that in the caller, and call to_hash at most once).

chrisseaton commented 3 years ago

99% of dynamic calls with keyword arguments do not have any kind of splatting - we just want something simpler to catch that don't we?

And yes ask the caller what it accepts.

Yes it would add one extra check on the proposed token, but that would be in most cases (we could count) a guard-deoptimise.

eregon commented 3 years ago

99% of dynamic calls with keyword arguments do not have any kind of splatting

I wouldn't be sure, **kwrest is fairly frequent for delegation. And in that case, for the bar() call in def foo(a, **kw); bar(a, **kw); end it seems quite messy (complexity and maintainability-wise) if we need to fallback to some logic that reads from the Hash in the callee.

chrisseaton commented 3 years ago

I wouldn't be sure

It was a fact not a guess :)

eregon commented 3 years ago

In any case I think there needs to be a single way to handle keyword arguments on the callee side. Otherwise we'll have 2 pieces of code out of sync, with a lot of complexity and that's going to be error-prone. IIRC, that was a big reason why we had to revert the original optimization, because it was too hard to keep it working + fix the semantics and have both be compatible semantically.

Kind of a lessons learned from the recent strftime optimization, if there is an optimized path it needs to always trigger for a given kind of input, not whether the caller used **kw or a: b.

chrisseaton commented 3 years ago

Either way, a first step is to pass keywords as positionals, based on an agreed order. We could do that step and then expand, I think?

wildmaples commented 3 years ago

Last time we spoke, we agreed to come up with a prototype for this issue and iterate over it, since MRI also has a similar optimization done. I'm currently working on it.

eregon commented 3 years ago

FYI @norswap and I are working on a refactoring of Hash strategies. That might bring some conflicts. I suggest to avoid any new code touching Hash internals if possible, if it's only existing nodes then it should be relatively easy to migrate.

eregon commented 3 years ago

FYI, that refactoring was merged in https://github.com/oracle/truffleruby/commit/6e7a9b038134309ba3b1aab2550ab6e8828ce3e7

chrisseaton commented 3 years ago

Here's a new sketch idea I've discussed with @eregon and @wildmaples.

caller always sends keyword argument values as flat array of values in the arguments array - [a0, a1, ..., v0, v1, ...], where a are the positional arguments, and v are just the values of the keyword arguments
caller also passes a description of the keywords [k0, k1, ...], where k is just the keywords of the keyword arguments - this instance is created statically and pooled globally so it can be compared by reference
keyword arguments are packed simply in the order they appear in the code
caller passes splat as normal
this isn't done per-call-target - this is the calling convention for all calls with keyword arguments, cached or uncached, monomorphic or megamorphic

So the caller always gets

foo(a0, a1, ..., v0, v1, ..., [k0, k1, ...])

callee always has responsibility to unpack this
if it wants a real hash, it can construct it - benefit in this case is this hash is now not always escaped
if it wants just the values, it can pick them out of the array
how does it pick them out when they're not in a defined order?
well we get that descriptor, which we can specialise for!

So the callee looks like

def foo(*args, descriptor)
  deopt if descriptor != expected
  k0 = args[x]
  k1 = args[y]
  ...
end

Or maybe for different call sites it's a bit polymorphic - not a big problem

def foo(*args, descriptor)
  if descriptor == d0
    k0 = args[x0]
    k1 = args[y0]
  elsif descriptor == d1
    k0 = args[x1]
    k1 = args[y1]
  else
    deopt
  end
  ...
end

Or if the callee wanted a hash

def foo(*args, descriptor)
  kwargs = {descriptor[0] => args[x], descriptor[1] => args[y], **splat} # hash need not escape!
  ...
end

Why is this so good?

much simpler on caller side
caller doesn't need any info on callee
if a hash is really needed it doesn't need to escape the caller
we're good at inline caching for things like the descriptor already
should be better even for inlined cases as we don't have to PE through the hash
doesn't rely on PE so should be better for interpreter as well

Examples

Positional and keywords

def foo(x, y, a:, b:)
  ...
end

foo(1, 2, a: 3, b: 4)

translated to

def foo(*args, descriptor)
  x = args[0]
  y = args[1]
  deopt unless descriptor == [:a, :b]
  a = args[2] # descriptor.indexof(:a) but as a constant
  b = args[3] # descriptor.indexof(:b) but as a constant
end

foo(*[1, 2, 3, 4], [:a, :b])

Just keywords

def foo(a:, b:)
  ...
end

foo(a: 3, b: 4)

translated to

def foo(*args, descriptor)
  deopt unless descriptor == [:a, :b]
  a = args[2] # descriptor.indexof(:a) but as a constant
  b = args[3] # descriptor.indexof(:b) but as a constant
end

foo(*[3, 4], [:a, :b])

None

def foo()
  ...
end

foo()

translated to

def foo(*args, descriptor)
  deopt unless descriptor == nil
end

foo(*[], nil)

Just positional

def foo(x, y)
  ...
end

foo(1, 2)

translated to

def foo(*args, descriptor)
  deopt unless descriptor == nil
  x = args[0]
  y = args[1]
end

foo(*[1, 2], nil)

eregon commented 3 years ago

That sounds great!

What happens for foo(**h)? Also use the descriptor in that case since there are keywords arguments on the caller side?

I think there is an issue if we want to do this in 2.7 and not in 3.0, because in 2.7 it's possible (although it warns with -w) to call a method accepting keyword arguments without using the keyword arguments syntax at the call site. In 3.0 it's an error so we don't need to care about that case.

def foo(kw: nil)
  kw
end

h = {kw: 42}
p foo(h)

2.7:
-:7: warning: Using the last argument as keyword parameters is deprecated; maybe ** should be added to the call
-:2: warning: The called method `foo' is defined here
42

3.0:
-:2:in `foo': wrong number of arguments (given 1, expected 0) (ArgumentError)
    from -:7:in `<main>'

eregon commented 3 years ago

There is also the inverse case of:

def foo(h)
  h
end
p foo(a: 1)

It's unclear in such a case if it's beneficial to use the descriptor. I guess at least it's not incorrect to use it as it's a new Hash created, and for foo(**h) it would copy anyway.

We could easily check if the callee does not accept keyword arguments and in that case pass no descriptor and everything as positional.

eregon commented 3 years ago

Another complication is since Ruby 2.7 any key is allowed in a keyword hash, not just Symbols. So foo(**h) might have non-Symbols keys, and the descriptor would need to handle that. It makes comparison less easy, and is an issue about leaking keys inside descriptors globally.

https://github.com/oracle/truffleruby/issues/2388#issuecomment-867901660 avoids this issue because it only passes kwargs positionally if they are explicit kwargs (k:, not **kw) in the callee, and explicit kwargs are always Symbol keys.

chrisseaton commented 3 years ago

What happens for foo(**h)?

def foo(x, y, a:, b:)
  ...
end

foo(1, 2, a: 3, **etc)

translated to

def foo(*args, descriptor)
  x = args[0]
  y = args[1]
  deopt unless descriptor == [:a]
  a = args[2] # descriptor.indexof(:a) but as a constant
  b = args.kwhash[:b] # we know b isn't in the arguments because it's not in this descriptor
  # just need to be careful about which takes precendece - a b passed in arguments or a b in the splat
  ...
end

foo(*[1, 2, etc, 3], [:a])

args.kwhash is usual logic for finding the keyword argument hash.

a method accepting keyword arguments without using the keyword arguments syntax at the call site

This is fine - because the descriptor will say the keyword wasn't passed explicitly, so look it up as normal.

def foo(kw: nil)
  kw
end

h = {kw: 42}
p foo(h)

translated to

def foo(*args, descriptor)
  deopt unless descriptor == nil
  kw = args.kwhash[:kw]
end

h = {kw: 42}
p foo([h], nil)

There is also the inverse case of

def foo(h)
  h
end
p foo(a: 1)

translated to

def foo(*args, descriptor)
  deopt unless descriptor == [:a]
  h = {descriptor[0] => args[0]}
end

foo(*[], [:a])

Another complication is since Ruby 2.7 any key is allowed in a keyword hash

If it's already in a hash, we don't take it out.

def foo(**kwargs)
  kwargs
end
foo({'foo' => 14})

translated to

def foo(*args, descriptor)
  deopt unless descriptor == nil
  args.kwhash
end

foo(*[{'foo' => 14}], niil)

eregon commented 3 years ago

args.kwhash is usual logic for finding the keyword argument hash.

So that means we would need to have to maintain both paths then when reading kwargs, reading from packed arguments or from a Hash. In other words, only literal kwargs use the optimization. Seems not completely ideal, but it does simplify things quite a bit in that design.

To emit code like the it's a bit polymorphic case above we basically need to rework the AST a bit so ReadKeywordArgumentNode are not just in sequence each with a ReadUserKeywordsHashNode (like currently), but there is some overarching node say HandleKeywordArgumentsNode or so, which then has multiple ReadKeywordArgumentNode as children,and passes them the keyword hash. That way there is only one ReadUserKeywordsHashNode per method (under the HandleKeywordArgumentsNode), which is a nice gain. That improvement can be done as a first step independent of what we do later, so I would suggest to start with that.

Visually:

* ReadKeywordArgumentNode(:a)
  * ReadUserKeywordsHashNode
* ReadKeywordArgumentNode(:b)
  * ReadUserKeywordsHashNode

becomes

* HandleKeywordArgumentsNode
  * ReadUserKeywordsHashNode
  * ReadKeywordArgumentNode(:a)
  * ReadKeywordArgumentNode(:b)

tomstuart commented 3 years ago

This sounds great! My mental model of the tradeoffs isn’t as clear as it should be, so I could use some help developing my intuition:

Am I correct that the main advantage of the call-target-agnostic approach is that it’s simpler (potentially much simpler) to implement on the caller side, rather than necessarily faster than the call-target-specific approach? Naively it feels like the former must do some of the work twice — the polymorphic inline cache has to decide which callee we’re using, and then the callee has to check which descriptor it’s been given by the caller — whereas the idea of the call-target-specific solution is to do the work once as part of the PIC’s unavoidable callee decision so that the callee can avoid dealing with different argument formats entirely.

Is my understanding correct? Would we expect any overhead of the agnostic approach to effectively disappear as part of PE or be ameliorated by some other consideration, or are we saying that its simplicity is in itself the reason to prefer it?

eregon commented 3 years ago

It's very good points. I think I don't know of all trade offs yet either.

Yes, the call-target-agnostic approach means the callee needs to handle either the packed kwargs, or extract from a Hash, and the packed kwargs might be in different order for different callers. The logic for extracting from a Hash feels pretty messy (+ duplicate logic to unpack from packed kwargs), so from that point of view it is not ideal. If that strategy is used regardless of the callee (e.g., even if the callee accepts no keyword arguments), then it can be done only once per call site.

The call-target-specific approach would indeed likely be more efficient for the callee, as the callee could always assume explicit keyword arguments (def m(kw:)) are basically the same as positional arguments (reading kw is reading at a constant index, that's it). A trade-off is for polymorphic call sites (multiple callee CallTargets) we would have multiple times the logic to pack keyword arguments (but polymorphic call sites are rather rare, seems minor). And another is for megamorphic call sites we can't know what keywords the callee accepts at PE time, and we will have to prepare the arguments in an uncached way for that case. Might be relatively easy with uncached nodes and now having a HashStoreLibrary, but would need to see in practice.

chrisseaton commented 3 years ago

This is what MRI is doing at the moment.

https://github.com/ruby/ruby/commit/79ddbe9deebc5af0f08e52cd056f58b49e486ea6

Note that I think rb_iseq_only_kwparam_p is misleading - it means only keyword arguments and zero or more leading arguments.

chrisseaton commented 3 years ago

MRI does pass literal keyword arguments flatly, which is what we're proposing, but then it looks them up by a linear search each time, so it's like our packed hash strategy.

args_setup_kw_parameters_lookup(const ID key, VALUE *ptr, const VALUE *const passed_keywords, VALUE *passed_values, const int passed_keyword_len)
{
    int i;
    const VALUE keyname = ID2SYM(key);

    for (i=0; i<passed_keyword_len; i++) {
    if (keyname == passed_keywords[i]) {
        *ptr = passed_values[i];
        passed_values[i] = Qundef;
        return TRUE;
    }
    }

    return FALSE;
}

But there's no descriptor of the arguments like we're proposing to be able to compile this loop away, and no specialisation. The callee does get the caller site info, and uses this to interpret the flat array of keyword argument values.

I'll keep looking into what MRI does in more depth and start to document it more.

chrisseaton commented 3 years ago

JRuby always passes a full hash, and extracts keyword argument values like this:

    private static IRubyObject toHash(ThreadContext context, IRubyObject lastArg) {
        if (lastArg instanceof RubyHash) return (RubyHash) lastArg;
        if (lastArg.respondsTo("to_hash")) {
            lastArg = lastArg.callMethod(context, "to_hash");
            if (lastArg == context.nil) return lastArg;
            TypeConverter.checkType(context, lastArg, context.runtime.getHash());
            return (RubyHash) lastArg;
        }
        return null;
    }

    public static RubyHash extractKwargsHash(ThreadContext context, Object[] args, int requiredArgsCount, boolean receivesKwargs) {
        if (!receivesKwargs) return null;
        if (args.length <= requiredArgsCount) return null; // No kwarg because required args slurp them up.

        Object lastArg = args[args.length - 1];

        if (lastArg instanceof IRubyObject) {
            IRubyObject returnValue = toHash(context, (IRubyObject) lastArg);
            if (returnValue instanceof RubyHash) return (RubyHash) returnValue;
        }

        return null;
    }

    public static IRubyObject receiveKeywordArg(ThreadContext context, IRubyObject[] args, int required, RubySymbol key, boolean acceptsKeywordArgument) {
        RubyHash keywordArguments = extractKwargsHash(context, args, required, acceptsKeywordArgument);

        if (keywordArguments == null) return UndefinedValue.UNDEFINED;

        if (keywordArguments.fastARef(key) == null) return UndefinedValue.UNDEFINED;

        // SSS FIXME: Can we use an internal delete here?
        // Enebo FIXME: Delete seems wrong if we are doing this for duplication purposes.
        return keywordArguments.delete(context, key, Block.NULL_BLOCK);
    }

Notice the mutation of lastArg on the conversion, to fast-path it the next time.

eregon commented 3 years ago

Here are some more discussion points (discussed with Chris).

CallTarget-specific might be more complex in terms of logic, because mix-and-matching caller kwargs with callee kwargs
But CallTarget-agnostic needs to handle unpacking either from the Shape-like kwargs or a full Hash in the callee.
CallTarget-specific leads to more generated code, due to having more call sites than methods
CallTarget-specific is slower for megamorphic calls, because the CallTarget is a dynamic value then, but megamorphic calls accepting kwargs are very rare.
Not-passed kwargs would need to be checked individually in the callee for CallTarget-specific, but can be checked with one shape check with CallTarget-agnostic
CallTarget-agnostic can lead to having duplicated logic for e.g. [:a, :b] and [:b, :a] when callers pass the same kwargs but in different order, which is not so unlikely. CallTarget-specific avoids that in the callee (the callee is very simple), but always has duplicate generated code in callers (which can optimize differently based on the call site)
For delegation with def m(*a, **kw); target(*a, **kw); end (in Ruby 3+), CallTarget-agnostic would build a full Hash in m, and pass it to target. CallTarget-specific would build the Hash in the caller of m, and then unpack to pass it to target, which is not great if target is quite polymorphic (e.g., via send and when not split).
In terms of warmup it might be suboptimal that the callee needs to be recompiled with CallTarget-agnostic if a new caller passes kwargs in a different order, but that's probably rare. CallTarget-specific would only recompile the callee if an kwarg was always passed before and is not, or vice-versa.
def m(a: nil, b: nil, **kwrest); end can be complicated withm(a: 1, **h) and CallTarget-agnostic because :b needs to be looked up in kwrest and not just in the descriptor. Also if h is {a: 2} it needs to override the value of a in the callee; probably best to simplify by ensuring either there is no overlap between packed kwargs and the Hash in the caller, or in the callee which then checks both but the Hash first; or always merge to the Hash and pass an empty descriptor for that case.

eregon commented 3 years ago

@wildmaples @chrisseaton Since it seems you prefer the CallTarget-agnostic approach and you're doing the work I say go for it, I think both approaches are good and have their own merits.

Representation-wise, it might be interesting to use a DynamicObject + Shape, that would avoid boxing primitive values. That means extra nodes (using DynamicObjectLibrary) for writing/reading though. Could also be flat in the Object[] frame arguments and having the descriptor as just a mapping of names to indices, which can probably use something similar to SharedIndicesMap, except it would only be mutated when creating the map/shape and never later, and globally pooled. Leaking those (per RubyLanguage instance) is probably not a big concern since it's bound by literal kwargs in call sites in the source code.

oracle / truffleruby