timholy / SnoopCompile.jl

Provide insights about latency (TTFX) for Julia packages
https://timholy.github.io/SnoopCompile.jl/dev/
Other
309 stars 48 forks source link

[question] Multiple calls to `@snoopi_deep` #222

Closed tlienart closed 3 years ago

tlienart commented 3 years ago

Hello,

I was trying out a few things in a project to reduce the number of inference triggers when I realised that if I ran @snoopi_deep after restarting Julia, I got a huge number of triggers whereas after that I only got 4, all from Base. Which result should I consider?

Also is it possible to filter the inference triggers by module?

Thanks!

timholy commented 3 years ago

Sorry to be slow to respond, #225 has been taking a lot of time.

You generally want to run @snoopi_deep in a fresh session. Once you infer a method, you generally don't need to reinfer it. (Redefining the method of course forces it to be reinferred.)

As far as filtering goes, for @snoopr-measured invalidations, you can use filtermod. That only filters on the root; we might want something that "hits" anything in a given module?

For @snoopi_deep inference triggers, should be able to do something like

filter(itrig -> itrig.callerframes[end].linfo.def.module === Franklin, itrigs)

That should probably become filtermod on a list of itrigs.

tlienart commented 3 years ago

Ah crap then I have 160 of those :/ (this is for another, much smaller module, which will eventually replace Franklin's parser: FranklinParser).

I think once I get the hang of reducing those, I might try to make a PR to extend the docs :)

Thanks again Tim!

timholy commented 3 years ago

It's pretty in flux right now. suggest can triage them to help eliminate ones you don't need to address, or that might need to be addressed elsewhere (e.g., Julia itself). For example, my investigations have led me to tackle the concatenation pipeline, and led to these PRs:

and the net effect is an improvement in both runtime performance and precompilability.

I will play with FranklinParser at some point and see if I can help provide some context and/or additional tools. I recognize the complexity of these tools and am working to make them accessible to a wide audience, and you're very much in that target audience. So complain if it's unclear or too much work and I'll see what I can do to decrease the effort.

For now, the invalidations should be a much narrower set than all inference failures. In the long run, it may be easier to identify problems from the inference failures, but as you say right now the list of inference failures can be disturbingly long.

timholy commented 3 years ago

The main thing I'd recommend is to check whether you've tried the profile-guided despecialization GUI yet. https://timholy.github.io/SnoopCompile.jl/dev/pgdsgui/ I find it's a nice way to get a sense for the latency cost of design decisions.

tlienart commented 3 years ago

Thanks a lot Tim! There's one very clear benefit of all this is that it made me think "wait, does the compiler know what that thing is or is it only known at runtime" and while this is painful in some cases, I think overall it might push me and others to just do things better (eg I now stumbled upon the section in the Julia sections in the docs warning against abusing the despatch system, something I was certainly doing).

I'll keep trying and will provide specific feedback once I'm a bit more comfortable with these tools!

timholy commented 3 years ago

Very good. You don't have to optimize everything; if "abusing dispatch" makes your code cleaner, and you don't really care about the runtime or latency hit, there's no problem with it. But if you are trying to make lean, low-latency, performant code, then yes, these things matter.

tlienart commented 3 years ago

But if you are trying to make lean, low-latency, performant code, then yes, these things matter.

Yes well ultimately I'd like time to first serve in Franklin to be as low as possible, so if there are low hanging fruits to do so (and if it makes me a better Julia programmer along the way) I'll do it!

timholy commented 3 years ago

:+1: I see you still have a lot of invalidations. Fixing those is probably the highest payoff. To make your life easier I just pushed #226. I'll post a demo in https://github.com/tlienart/Franklin.jl/issues/752.

tlienart commented 3 years ago

Thanks so much!

tlienart commented 3 years ago

I went from 160 to 9 inference triggers in FranklinParser 😅 , the remaining ones are a bit hard to track down but the situation is likely significantly better than it was before! thanks again, this is an amazing package.

timholy commented 3 years ago

Woot! That's awesome news. Great work!

SnoopCompile 2.5.0 just came out a couple of moments ago. It has quite a few changes to the inference triggers analysis code. In particular, for https://github.com/timholy/SnoopCompile.jl/issues/222#issuecomment-762891137 the new aggregation of triggers by method, plus the ability to filter, may help reduce the daunting number.

tlienart commented 3 years ago

It's now down to zero 😅 and I think the code is all the better for it.

One small note: to "remove" the final one, I had to inline a function that I used to create a const dictionary used elsewhere something like

const RULES = LittleDict{...}(
  key1 => method(params1),
  key2 => method(params2),
  ...
)

The suggestion from SnoopCompile was to try precompiling the function. It's not immediately clear to me why the inlining worked but maybe that could be an extra example in the docs?

timholy commented 3 years ago

That sounds good. Can you point me somewhere so I can understand the changes in greater detail?

tlienart commented 3 years ago

https://github.com/tlienart/FranklinParser.jl/tree/validator

Some of the changes are pretty major (the parsing before was running with "oracle functions" that were encapsulating a condition for a section of string to meet, now this is done with concrete objects).

this is the inline: https://github.com/tlienart/FranklinParser.jl/blob/07d2ecd8de9ebc0b135d586fbf8d2da03c5695a7/src/tokens/utils.jl#L102-L117

the function forward_match is only used as a way to define rules like here: https://github.com/tlienart/FranklinParser.jl/blob/07d2ecd8de9ebc0b135d586fbf8d2da03c5695a7/src/tokens/markdown_tokens.jl#L8-L18

this: https://github.com/tlienart/FranklinParser.jl/blob/validator/src/tokens/find_tokens.jl is the function I ended up significantly refactoring in view of the output from SnoopCompile.

if you compare it with https://github.com/tlienart/FranklinParser.jl/blob/main/src/tokens/find_tokens.jl you'll see some, probably obvious, elements:

I also got rid of a few NTuple{N, Char} where N to replace by Vector{Char}, looks like inference wise this is better?

timholy commented 3 years ago

Thanks!

So I presume you're referring to this:

julia> itrig = ftrigs[1]
Inference triggered to call MethodInstance for forward_match(::String, ::Vector{Char}, ::Bool) from forward_match (/home/tim/.julia/dev/FranklinParser/src/tokens/utils.jl:108) inlined into MethodInstance for forward_match(::String) (/home/tim/.julia/dev/FranklinParser/src/tokens/utils.jl:108)

julia> suggest(itrig)
/home/tim/.julia/dev/FranklinParser/src/tokens/utils.jl:108: invoked callee (forward_match(refstring::String) at utils.jl:108 may fail to precompile)

(I removed the @inline). The bottom line is that I think that invoked callee triggers probably reflect a bug in Julia. I've noticed that methods that supply default arguments sometimes fail to precompile. xref https://github.com/JuliaLang/julia/issues/38983.

It's quite interesting that inlining circumvents this problem, but I'm not sure I want to recommend that in general---forced inlining can also increase latency since it prevents you from just exploiting already-compiled code. Some methods that don't precompile are so huge that forced inlining would be a bad idea.

On balance, I think we just need to fix this bug. Fortunately, in your case the cost is cheap:

julia> itrig.node
InferenceTimingNode: 0.001454/0.002313 on InferenceFrameInfo for FranklinParser.forward_match(::String, ::Vector{Char}, ::Bool) with 5 direct children

You probably won't really notice that 2.3ms very much. There are circumstances where this is more problematic than that (e.g., https://github.com/JuliaLang/julia/issues/38983#issuecomment-757143042 as well as that JuliaInterpreter.step_expr! method).

the function I ended up significantly refactoring

Yes, that looks like a good move from an inferrability/latency perspective. There are cases where it's valid to do it the way you were doing it previously, but if that λ was getting called many times on short jobs, I suspect your change may also improve runtime performance. If you find that's not true (regex parsing can be slow), and if runtime performance matters more than latency, then of course you could go back to the λ approach. If FranklinParser defines all the functions that will ever be called from find_tokens, you can add precompile directives to at least make sure they're all ready to go. (You only get an inference trigger if it's not already inferred.) You'd still have to pay the cost of runtime dispatch, but that sometimes may be smaller than running a regex parser. (And sometimes, not.)

I also got rid of a few NTuple{N, Char} where N to replace by Vector{Char}, looks like inference wise this is better?

If N is in a small list the former would be ok, e.g., making it Union{Tuple{Char}, Tuple{Char,Char}} rather than NTuple{N,Char} would probably give you better runtime performance than Vector{Char}and no extra cost in latency. But if N is not bounded, or even if it ranges from, say, 1-10, you're probably better off with the Vector{Char}. If you really have a performance bottleneck and you want to use the tuple, a "manual union splitting" block like

tf = if set isa NTuple{1,Char}
    c ∈ set
elseif set isa NTuple{2,Char}
    c ∈ set
elseif set isa NTuple{3,Char}
    c ∈ set
elseif set isa NTuple{4,Char}
    c ∈ set
...
end

can circumvent the inference problem from using NTuple{N,Char}. As long as the types listed after the isa are concrete types, the type-check is just a pointer comparison (i.e., very fast) and then the c ∈ set gets specialized for each length of tuple. That block is the same thing that would be created by Julia's automatic union-splitting, but that only happens for unions with 3 or fewer types, whereas with the manual approach you can choose how many cases to write out.

Again, these are ugly and should only be reserved for something that is very "hot" performance-wise. But it's nice that Julia makes this kind of code possible when it's really important.

tlienart commented 3 years ago

Thanks a lot Tim! this all makes sense to me. I think in terms of FranklinParser I probably won't try to optimise it much more as there's a bunch of other low hanging fruits to fix in Franklin itself first and runtime performance right now is already high enough (standard pages build in ~5-10ms). But I've enjoyed learning about all this a lot and will definitely keep an eye on SnoopCompile and its evolutions as it really helps be aware of some peculiarities that are good to be kept in mind 😄

Again thanks a lot for all your advices and recommendations, I learned a lot!

timholy commented 3 years ago

Yeah, from looking at FranklinParser briefly I concluded it was good to go. Hopefully this adventure will bring you many long-term benefits!