Closed williballenthin closed 11 months ago
example improvement (re-using concatenated section data):
before:
❯ time ./GoReSym 11d7cb5750c44c40e767c7c4fa0f388ed64f636cf6b97ceee7bf9ae77683a3c0 > /dev/null
2023/07/21 13:18:58 profile: cpu profiling enabled, cpu.pprof
GoReSym: profile: cpu profiling disabled, cpu.pprof
________________________________________________________
Executed in 15.20 secs fish external
usr time 15.01 secs 151.00 micros 15.01 secs
sys time 0.87 secs 136.00 micros 0.87 secs
❯ pprof -cum -top cpu.pprof
File: GoReSym
Type: cpu
Time: Jul 21, 2023 at 1:18pm (CEST)
Duration: 15.19s, Total samples = 15.42s (101.50%)
Showing nodes accounting for 14.99s, 97.21% of 15.42s total
Dropped 88 nodes (cum <= 0.08s)
flat flat% sum% cum cum%
0 0% 0% 14.67s 95.14% main.main
0 0% 0% 14.67s 95.14% runtime.main
0.01s 0.065% 0.065% 14.40s 93.39% main.main_impl
0 0% 0.065% 12.78s 82.88% github.com/mandiant/GoReSym/objfile.(*Entry).PCLineTable
0 0% 0.065% 12.78s 82.88% github.com/mandiant/GoReSym/objfile.(*File).PCLineTable (inline)
0 0% 0.065% 12.65s 82.04% github.com/mandiant/GoReSym/objfile.(*peFile).pcln
0.29s 1.88% 1.95% 12.65s 82.04% github.com/mandiant/GoReSym/objfile.(*peFile).pcln_scan
4.57s 29.64% 31.58% 12.14s 78.73% github.com/mandiant/GoReSym/objfile.findAllOccurrences (inline)
1.03s 6.68% 38.26% 7.57s 49.09% bytes.Equal (inline)
5.48s 35.54% 73.80% 5.48s 35.54% memeqbody
0 0% 73.80% 1.47s 9.53% github.com/mandiant/GoReSym/objfile.(*Entry).ModuleDataTable
0 0% 73.80% 1.47s 9.53% github.com/mandiant/GoReSym/objfile.(*File).ModuleDataTable
0 0% 73.80% 1.47s 9.53% github.com/mandiant/GoReSym/objfile.(*peFile).moduledata_scan
0 0% 73.80% 1.39s 9.01% github.com/mandiant/GoReSym/debug/pe.(*File).DataAfterSection
1.06s 6.87% 80.67% 1.06s 6.87% runtime.memequal
0.79s 5.12% 85.80% 0.79s 5.12% runtime.memmove
0 0% 85.80% 0.79s 5.12% runtime.systemstack
0 0% 85.80% 0.72s 4.67% runtime.gcBgMarkWorker
0 0% 85.80% 0.72s 4.67% runtime.gcBgMarkWorker.func2
0 0% 85.80% 0.72s 4.67% runtime.gcDrain
0 0% 85.80% 0.56s 3.63% github.com/mandiant/GoReSym/debug/pe.(*Section).Data
0 0% 85.80% 0.53s 3.44% runtime.growslice
after:
❯ time ./GoReSym 11d7cb5750c44c40e767c7c4fa0f388ed64f636cf6b97ceee7bf9ae77683a3c0 > /dev/null
2023/07/21 13:21:06 profile: cpu profiling enabled, cpu.pprof
GoReSym: profile: cpu profiling disabled, cpu.pprof
________________________________________________________
Executed in 6.74 secs fish external
usr time 6.46 secs 175.00 micros 6.46 secs
sys time 0.39 secs 150.00 micros 0.39 secs
❯ pprof -cum -top cpu.pprof
File: GoReSym
Type: cpu
Time: Jul 21, 2023 at 1:21pm (CEST)
Duration: 6.73s, Total samples = 6.64s (98.67%)
Showing nodes accounting for 6.51s, 98.04% of 6.64s total
Dropped 46 nodes (cum <= 0.03s)
flat flat% sum% cum cum%
0 0% 0% 6.35s 95.63% main.main
0 0% 0% 6.35s 95.63% runtime.main
0 0% 0% 6.12s 92.17% main.main_impl
1.98s 29.82% 29.82% 5.34s 80.42% github.com/mandiant/GoReSym/objfile.findAllOccurrences (inline)
0 0% 29.82% 3.67s 55.27% github.com/mandiant/GoReSym/objfile.(*Entry).PCLineTable
0 0% 29.82% 3.67s 55.27% github.com/mandiant/GoReSym/objfile.(*File).PCLineTable (inline)
0 0% 29.82% 3.61s 54.37% github.com/mandiant/GoReSym/objfile.(*peFile).pcln
0.03s 0.45% 30.27% 3.61s 54.37% github.com/mandiant/GoReSym/objfile.(*peFile).pcln_scan
0.44s 6.63% 36.90% 3.36s 50.60% bytes.Equal (inline)
2.44s 36.75% 73.64% 2.44s 36.75% memeqbody
0 0% 73.64% 2.34s 35.24% github.com/mandiant/GoReSym/objfile.(*Entry).ModuleDataTable
0 0% 73.64% 2.34s 35.24% github.com/mandiant/GoReSym/objfile.(*File).ModuleDataTable
0.03s 0.45% 74.10% 2.34s 35.24% github.com/mandiant/GoReSym/objfile.(*peFile).moduledata_scan
0 0% 74.10% 0.55s 8.28% github.com/mandiant/GoReSym/objfile.(*peFile).sectionData
and when paired with #26 the total runtime was down to 1.06 🔥
closed in #28
GoReSym spends a lot of time copying data around during the scans for pcln and moduledata. Each scan is invoked multiple time, and within each scan, there are multiple searches against section data. In other words, there's a pretty hot path here.
For example, when analyzing a kubectl binary, GoReSym spends more than 1.3s (!) in the
DataAfterSection
routine. This is surprising because the routine simply copies big data regions from sections into a single, contiguous array. Ideally, we should only spend a split-second copying section data around; spending a full second here means there's way too much extra work being done.It would be nice to invoke the routine fewer times; however, GoReSym has to juggle a fairly complex data recovery algorithm, so its not a quick job.
Some things that worked for me:
DataAfterSection
trading memory for CPUUnfortunately, due to #23 I don't have GoReSym working in WSL, so I was doing my optimization against v2.3 and there were merge conflicts when cherry picking (1) into master. (2) still works for me an I may open a PR to propose this optimization.