p4lang / p4c

P4_16 reference compiler
https://p4.org/
Apache License 2.0
641 stars 428 forks source link

Do not recalculate refmap inside inliner so often #4723

Open asl opened 2 weeks ago

asl commented 2 weeks ago

We are seeing ResolveReferences taking 25% out of 90 second compile time of large p4 app. The issue here is inliner as it clears and recalculates refmap of the whole program after each inlining step and here we are having lots of inlining happening.

Certainly, properly updating refmap might be a bit tricky, but still something should be done :)

fruffy commented 2 weeks ago

That seems like a crazy statistic in general. Either the inliner calls ResolveReferences hundreds of times or the pass is just slow.

asl commented 2 weeks ago

That seems like a crazy statistic in general. Either the inliner calls ResolveReferences hundreds of times or the pass is just slow.

It former. Here are some stats: overall ResolveReferences was called 175 times. Each time is approx 120-150 ms. This sums to 22 seconds total time, yes.

asl commented 2 weeks ago

One of factors that contribute is quite severe, I'd say. Essentially, portion of all the time is spent in ISimpleNamespace::getDeclByName(). Which in turn is just IndexedVector<T>::getDeclaration. IndexedVector<T> stores declaration name in ordered_map<cstring, const IDeclaration *>. The latter is essentially the pair of:

Comparator here essentially dereferences cstring* and calls std::less<cstring>(a, b). The latter is:

    bool operator<(cstring a) const { return *this < a.str; }
    bool operator<(const char *a) const { return str ? a && strcmp(str, a) < 0 : !!a; }

Do you see this strcmp call? So, yes, you are reading it correctly: in all ordered_map<cstring, foo>corresponding keys are ordered lexicographically.

This corresponds to maybe 10% of ResolveReferences runtime. Interesting enough, everything else is just overhead from topdown IR visiting :)

asl commented 2 weeks ago

It seems that ActionsInliner does some updates of ReferenceMap (using SubstituteParameters), but it seems it is not enough.

@mihaibudiu @ChrisDodd Do you remember any details here?

mihaibudiu commented 2 weeks ago

I am not sure what "is not enough" in the above statement means.

@ChrisDodd has expressed interest to remove the ResolveReferences pass for a long time, and he did some work in this direction, but it was never completed. But I am not sure that his proposed changes will improve the runtime.

This is just how the compiler was architected: the types and references are recomputed every time the IR changes. We never imagined that this would become a bottleneck.

asl commented 2 weeks ago

I am not sure what "is not enough" in the above statement means.

Well, here is the comment in substituteParameters.h:

    // When a PathExpression is cloned, it is added to the RefMap.
    // It is set to point to the same declaration as the original path.
    // But running this pass may change some declaration nodes - so
    // in general the refMap won't be up-to-date at the end.

And indeed, not recalculating refMap leads to some declaration not found assertions. I have not checked yet what they are and what they correspond to.

ChrisDodd commented 1 week ago

You can now look up references in the current context of any visitor, by having tbe visitor inherit from P4::ResolutionContext and then calling any of the various resolve methods in that class to look up the declaration for a name. This makes the refMap unnecessary.

The memoization that now happens in these resolve methods should make repeatedly resolving the same names quite fast,

fruffy commented 1 week ago

You can now look up references in the current context of any visitor, by having tbe visitor inherit from P4::ResolutionContext and then calling any of the various resolve methods in that class to look up the declaration for a name. This makes the refMap unnecessary.

The memoization that now happens in these resolve methods should make repeatedly resolving the same names quite fast,

That's very useful. Any particular requirements for the P4 program that P4::ResolutionContext has?

asl commented 1 week ago

You can now look up references in the current context of any visitor, by having tbe visitor inherit from P4::ResolutionContext and then calling any of the various resolve methods in that class to look up the declaration for a name. This makes the refMap unnecessary.

Sadly, everything is entangled in quite messy mass. refMap is also used for unique name generation. So, for example, MethodInstance::resolve does not require full refMap, but it may resort to TypeInference which requires "full" refMap (so, both DeclarationLookup and NameGenerator). It just happens that mid-end use-def only calls MethodInstance::resolve with null typeMap, so things are working fine for it :) TypeInference could certainly be made a ResolutionContext, but NameGenerator would still be required.

So, looks like we'd need in some cases to still use refMap for name generation. At least for the time being.

ChrisDodd commented 1 week ago

So, looks like we'd need in some cases to still use refMap for name generation.

For name generation, you can use the MinimalNameGenerator pass/object instead of the refmap -- tracking unique names is really independent of contextual name lookup, so a separate class seems better. That would requirie refactoring TypeInference to take both a DeclarationLookup and a NameGenerator (when using the RefMap they could be the same). I have an old refactoring of type inferencing sitting a a branch I had been working on to remove the refmap from type inferencing, but I never got it completely working. (https://github.com/ChrisDodd/p4c/tree/typecheck-remove-refmap)

asl commented 1 week ago

So, looks like we'd need in some cases to still use refMap for name generation.

For name generation, you can use the MinimalNameGenerator pass/object instead of the refmap -- tracking unique names is really independent of contextual name lookup, so a separate class seems better. That would requirie refactoring TypeInference to take both a DeclarationLookup and a NameGenerator (when using the RefMap they could be the same).

Yes, this is exactly what I'm playing with now :)

asl commented 1 week ago

but I never got it completely working.

One issue that I already found is that ResolveReferences has implicit and very subtle assumption to be mixed into Inspector. TypeInference is a Transform that even in read-only mode does useless cloning of the nodes (this certainly needs to be fixed eventually). As a result, we are calling getDeclaration on cloned nodes and shallow node comparisons inside lookup fail. The solution is to check in the original context, not the cloned one.

Currently looking over remaining issues here and there.

asl commented 1 week ago

Some very preliminary results from downstream test. Before:

________________________________________________________
Executed in   87.54 secs    fish           external
   usr time  134.23 secs    0.26 millis  134.23 secs
   sys time    2.08 secs    3.87 millis    2.08 secs

After:

________________________________________________________
Executed in   69.84 secs    fish           external
   usr time  101.63 secs    0.06 millis  101.63 secs
   sys time    2.64 secs    2.17 millis    2.63 secs
fruffy commented 1 week ago

Some very preliminary results from downstream test. Before:

________________________________________________________
Executed in   87.54 secs    fish           external
   usr time  134.23 secs    0.26 millis  134.23 secs
   sys time    2.08 secs    3.87 millis    2.08 secs

After:

________________________________________________________
Executed in   69.84 secs    fish           external
   usr time  101.63 secs    0.06 millis  101.63 secs
   sys time    2.64 secs    2.17 millis    2.63 secs

Are these gains specifically because of inliner optimizations or changes to the refMap/getDeclByName?

asl commented 1 week ago

Are these gains specifically because of inliner optimizations or changes to the refMap/getDeclByName?

The whole idea is to sunset refMap entirely. The speedup above is obtained by switching to ResolutionContext inside inliner, so declarations are looked up when necessary (without recalculation of path => declaration map for all paths in the program which is certainly huge overkill even for small programs). This removed ~175 recalculation of refMap via ResolveReferences.

For the record, this particular app used to take ~450 seconds to compile back in December 2023.

asl commented 6 days ago

https://github.com/p4lang/p4c/pull/4757 is a working PR accumulating required changes