tud-zih-energy / lo2s

Linux OTF2 Sampling - A Lightweight Node-Level Performance Monitoring Tool
https://tu-dresden.de/zih/forschung/projekte/lo2s?set_language=en
GNU General Public License v3.0
45 stars 13 forks source link

fix: move calling context merge until the very end, #241

Closed tilsche closed 1 year ago

tilsche commented 1 year ago

while still destructing the monitoring threads early.

This fixes an issue with unknown functions due to missing mmap entries when deconstructing monitors of threads early.

Fixes #240

tilsche commented 1 year ago

Of course this doesn't work with the switch writer, as the CI nicely points out.

The switch writer seems to have some strange fake calling contexts with Process::invalid() - I have no idea what for. It doesn't even seem to be used, as in having any addresses in the children. Maybe we could just remove it.

I would really love to cut the old switch writer out. This is a maintenance hell. Especially since we need to test the changes that affect them separately.

But sure enough, we have to run 9 year old kernel version (rather than the 7 year Linux 4.3 that introduced PERF_RECORD_SWITCH). Even on hardware that is little over a year old.

cvonelm commented 1 year ago

I have fixed the SwitchWriter in fix-broken-ip-lookup-v2.

Because the Calling Context writing stuff is very similar between the SwitchWriter and SampleWriter, I have moved it into a separate CallingContextManager class. With this change there is almost no actual code left in the SwitchWriter, hopefully making it less of a maintenance nightmare.

RE: Process::invalid() in SwitchWriter

Yeah this is a shortcoming of the SwitchWriter. Via sched:sched_switch we only get the TID of the switched thread, so we have to make something up for the PID.

This is actually pretty much harmless: The PID field of the ThreadCctxRefMap entry only comes in to play when the Thread was never encountered before merging the calling contexts and then only controls the name of the thread and which region_group it belongs to.

We never encounter the thread before in system monitoring mode only when all of the following things happen at the same time: We dont get the thread in the initial get_comms_for_running_threads() scan. We dont get a PERF_RECORD_COMM event for that thread during measurement. We dont get the thread in the get_comms_for_running_threads scan after measurement has completed.

Which all things considered, should never happen. And according to the traces I have recorded with the SwitchWriter it also has not happened to me.

The question remains whether PID=TID might be less wrong than PID=-1

cvonelm commented 1 year ago

I have thought a little bit about your issue with the CallingContextManager doing actual writing stuff and while I can see where you come from, I can not condone that change. This would lead to bit-for-bit duplicated code between the SwitchWriter and Sample based Writer and would result in some awkward hand-overs between the Manager code and the Writer Code.