pfitzseb / ProfileCanvas.jl

MIT License
87 stars 6 forks source link

`string(alloc.type)` can be expensive, resulting in a slow `view_allocs` #34

Open Sbozzolo opened 8 months ago

Sbozzolo commented 8 months ago

Something changed in Julia 1.10 (Julia is probably tracking more granual allocations), and ProfileCanvas.jl is now struggling producing views for allocation. To give a sense of the incrased number of allocations, profiling the same run on 1.9 produces an HTML file that is 600 kB, on 1.10 it is 18 MB.

The increased number of allocations tracked manifested the limits of view_allocs: some runs became virtually impossible to plot (cost of plotting > hours). I saw this behvior with julia 1.9 too when working with large allocation trees. I can prune the allocation tree (I probably don't need the very fine leaves anyway), but I think it would be still good to see if there is low-haning fruit to improve the algorithm/performance for view_alloc.

Most time (97%) is spent in string

image

I tracked this down to https://github.com/pfitzseb/ProfileCanvas.jl/blob/08619eb1f37464b7fe1af91ffc544d8a4ad238dd/src/ProfileCanvas.jl#L353

Indeed, I verified that string(alloc.type) can take up to 0.05 s on a very fast machine when the type is complex. With > 10^5-10^6 allocations, this can be very expensive.

It seems that string(alloc.type) is primarily used to compared with c.func, so I wonder if there's a way to avoid converting to string?

Sbozzolo commented 8 months ago

Something changed in Julia 1.10 (Julia is probably tracking more granual allocations), and ProfileCanvas.jl is now struggling producing views for allocation. To give a sense of the incrased number of allocations, profiling the same run on 1.9 produces an HTML file that is 600 kB, on 1.10 it is 18 MB.

The increased number of allocations tracked manifested the limits of view_allocs: some runs became virtually impossible to plot (cost of plotting > hours). I saw this behvior with julia 1.9 too when working with large allocation trees. I can prune the allocation tree (I probably don't need the very fine leaves anyway), but I think it would be still good to see if there is low-haning fruit to improve the algorithm/performance for view_alloc.

EDIT: Most time (97%) is spent in string

image

~I just realized from the above flamegraph that the bottleneck is our show method.~

Sbozzolo commented 8 months ago

Reopened with additional information.

(Sorry again for the noise).

charleskawczynski commented 8 months ago

@pfitzseb, this may be due to us removing a Base.show method for one of our heavily typed parameters.

Can we replace this string call with type_depth_limit? That might be generally better anyway.

Sbozzolo commented 8 months ago

@pfitzseb, this may be due to us removing a Base.show method for one of our heavily typed parameters.

My tests seem to indicate that removing Base.show does make things worse (twice as slow), but it is not the root of the problem (our CI is still using ClimaCore 0.11.7, that has the show method, and is suffering from this).

pfitzseb commented 8 months ago

Yeah, that seems reasonable. I'm not sure how type printing actually works, but the low-tech solution I had in mind is using a LimitIO (there's a package I think?) for this that ends up printing 50 chars at most. We're already using that in Debugger.jl and the VS Code extension in other places and historically it works decently well.

charleskawczynski commented 8 months ago

xref: https://github.com/JuliaPerf/PProf.jl/pull/86 (thanks to @Sbozzolo)