Patch: Multi-threading support for "real-time" / wall-clock CPU profiling

myzhan / gperftools

Automatically exported from code.google.com/p/gperftools

BSD 3-Clause "New" or "Revised" License

0 stars 0 forks source link

Patch: Multi-threading support for "real-time" / wall-clock CPU profiling #362

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago

Patch here:
https://github.com/justinsb/google-perftools/commit/7a14a67d123514358edf6d55b31d
499abd86cded

This is a (rough-draft) patch to address the problems with wall-clock 
("real-time") profiling of multi-threaded apps.  Setting CPUPROFILE_REALTIME 
enables the wall-clock timer, but it doesn't really work properly for 
multi-threaded apps.  A great overview of the problems can be found here:
http://code.google.com/p/google-perftools/issues/detail?id=106#c2

There are lots of formatting/style issues, but I think there will likely be 
discussion of the approach also.  I propose doing a cleanup once an acceptable 
approach is found.

Here's an overview of the current approach and some of the open questions:

* A new environment flag is introduced (CPUPROFILE_THREAD).  If you don't 
specify the thread you should get the default behaviour.

* With the flag, a high-priority "timer" thread is spawned, whose job it is to 
fire SIGPROF events on every thread, with a usleep in between.  It would be 
nice if the single "timer" thread could directly get the stack traces of each 
thread, but I don't think this is practical.  It may be possible to have this 
thread be a signal handler instead, but then I don't think this would work if 
every "normal" thread was in a syscall.

* We need to keep a thread list therefore.  Currently each thread must register 
itself by calling ProfileHandlerRegisterThread.  Neater approaches are 
possible, but this "works for me" i.e. it's not too onerous and it is reliable.

* I am not very knowledgable on the interaction of signals and syscalls, but I 
believe that during a syscall (1) the signal won't fire until the syscalls 
returns and (2) it will only fire once even if multiple signals are sent during 
a single syscall.  Therefore, we need to count the signals "missed" during a 
syscall.  To do this efficiently, the patch keeps a global timestamp and each 
thread has a per-thread variable which is the last timestamp applied.  The 
difference is the number of signals we missed in that syscall.  We then count 
all those "ticks" as if they were repeats of the current stack trace (assuming 
that we get called immediately once the syscall returns).  I think the 
one-signal per-syscall issue affects the single threaded case also (causing 
undercounting of long syscalls)

* This doesn't fix sleep - it still gets interrupted with every signal.  But I 
believe that this patch does not change anything in this regard.

Original issue reported on code.google.com by jus...@fathomdb.com on 25 Aug 2011 at 2:25

GoogleCodeExporter commented 9 years ago

Oops - there's an assert in the current patch that can be hit, because there's 
no real synchronization around the thread clock and signals.

I did also find this page, which documents Linux signals much more deeply:
http://www.kernel.org/doc/man-pages/online/pages/man7/signal.7.html

It looks like my assumptions around multiple signals being aggregated during 
syscalls is not correct (?)  It also suggests that "real-time signals" are 
queued and can also carry data (e.g. the global timestamp), which could 
eliminate the problem:

3 options I think:
Use real-time signals and rely on multiple delivery and hope that we don't 
overflow our structures.
Pass the current clock in the siginfo_t structure, thereby making the signal 
delivery and the timestamp "synchronized" (using pthread_sigqueue)
Just accept the fact that it's not perfect and trust that statistics will save 
us, perhaps with some tweaks.

Craig - any chance your profiling expert can dig me out of my confusion here?

Original comment by jus...@fathomdb.com on 25 Aug 2011 at 3:23

GoogleCodeExporter commented 9 years ago

First off, thanks for drawing up this patch!  It looks very impressive.

I wonder, though, if it was a mistake to add real-time (wall clock) CPU 
profiling to the profiler in the first place.  It wasn't written for that 
purpose, and I think that is showing now.  It would be better to just write an 
entirely separate profiler.  Hopefully common infrastructure like the 
profiledata class could be used by both profilers.

It's very possible that the problems you've been having are intrinsic to this 
kind of profiling, which may be one reason the profiling authors didn't try to 
support this in the first place.  It also sounds like any solution would be 
very OS-specific (and maybe even specific to a particular OS/CPU combo).

Combine all that with the fact I'm not sure what real-time profiling is really 
helpful for (wouldn't it be more useful to use perfmon or some other tool that 
told you why a particular bit of code was taking as long as it is?), and I'm 
not super-excited by an intrusive patch to add real-time profiling to perftools.

That said, if there's a compelling use-case, I'm amenable to adding it in.  But 
it would be better to structure the code so the real-time profiling doesn't 
interact with the 'normal' CPU profiling.

} Craig - any chance your profiling expert can dig me out of my confusion here?

I'll ask, but I don't know that their expertise extends to this area.

Original comment by csilv...@gmail.com on 25 Aug 2011 at 9:56

Added labels: Priority-Medium, Type-Patch

GoogleCodeExporter commented 9 years ago

Perhaps a bit of explanation would be useful here - I wrote this patch only 
after exhausting a frustratingly large number of alternatives!  It is working 
wonderfully for me in terms of identifying the true bottlenecks in my code - 
showing I/O rather than just CPU.  I'm not sure it's entirely accurate, but it 
is definitely the best I've been able to get out of any tool out there, and I'm 
getting good results by using it to guide my optimizations.

I checked out the other open-source profilers - I believe only valgrind 
supports wall-clock profiling, but because it uses emulation everything slows 
down by ~10x.  That's too distortative to be useful to me (because then my disk 
looks 10x faster than it is, I think)  Facebook has published "poor man's 
profiler", which is a script that calls gdb periodically to dump backtraces on 
all threads (and it has 1350 "likes", believe it or not): 
http://poormansprofiler.org/   I think of this patch as a combination of that 
approach with the (vastly superior) perftools infrastructure.

There are two commercial profilers I tried: Intel's VTune and RotateRight's 
Zoom.  Neither support Linux 3 at the moment, though I was able to hack Zoom's 
build scripts.  Zoom was nice, but (I think) only supports wall-clock profiling 
when run against a single-process, and then I'd have to attach it to every 
process, and it got complicated fast.

I looked next at the Linux kernel profiling tools: oprofile and perf.  They're 
OK, but they can't get stack traces into userspace on x86_64 unless everything 
is compiled with -fno-omit-frame-pointers, so essentially I'd have to compile 
everything myself.  Not really a great option.  (And you have to get stack 
pointers so that you can backtrace from the kernel into usercode to know where 
you're paused; the alternative is to try to hook every syscall and record the 
caller's address, but that's relatively high-overhead, at least when I tried 
it)  I think you also have to track the scheduler events, and then try to 
figure out the state of every thread at every moment in time.  The perf 
infrastructure on Linux is still fairly young - I was getting into some pretty 
serious patching to make progress.

I looked at perfmon2 but characterized it as being like oprofile & perf - the 
documentation isn't the easiest to decipher!

So I ended up full-circle - the tool has to run in userspace to get a stack 
trace, and so why not integrate with the one tool that really works 
(google-perftools).  I would say that after a long journey, the traces I'm 
getting now are wonderful & really useful (but perhaps that's just because 
after so much frustration I'm happy to see any real results).

I'm happy to be told "just use X instead", but I don't know what X is.

---

So in summary, this output is really useful for me when profiling non-CPU bound 
code - the hotspots are pretty different.  I do definitely recognize that you 
want to keep the code clean, so I'm happy to refactor with this in mind.  I 
guess if we could figure out what the best technical approach was (e.g. if we 
choose to queue signals, then we don't need the counters and don't need to pass 
the count into the stack trace aggregation functions)

Perhaps a good approach would be a "first order" patch - one that isn't 
necessarily entirely accurate (which should be tolerable given that we're 
sampling anyway), minimizes the changes to existing code, and is strictly 
additive in the same way that CPUPROFILE_REALTIME is.

I suggest that a good first-order approach would be quite similar to the 
current one.  I will investigate whether signals are reliably delivered in less 
than a typical sampling interval even during syscalls - if they are, then maybe 
the counters aren't needed, which would make everything much simpler.

The big disadvantage of separating out wall-clock profiling is that then I 
couldn't reuse all the surrounding infrastructure (http and programmatic 
interface etc).  Given that I don't think people will be doing both wall-clock 
and cpu-clock profiling at the same time, I think it's a net loss in terms of 
code complexity.  Let me try to build a cleaner patch now that I know this is a 
primary concern.

Original comment by jus...@fathomdb.com on 25 Aug 2011 at 11:23

GoogleCodeExporter commented 9 years ago

Yes, I was thinking of perfmon2 (oprofile can be good too, but as you say it 
can be hard to use).  That said, I haven't used it, so I don't know how well it 
will work.

I'm glad to see that perftools (with this patch) is working great for you!  I 
certainly don't take that lightly.  I like the idea of trying a simpler patch 
first, even at the cost of some accuracy.  Let's see how that looks; hopefully 
it will be clear what the best path forward is.

Original comment by csilv...@gmail.com on 25 Aug 2011 at 11:47

GoogleCodeExporter commented 9 years ago

To my knowledge, the tool to use for you is SUN/Orcacle's collect/analyze which 
is part of their free compiler suite. It has some quirks with C++, and with 
Linux (you'll sometimes have to restart the analyzer ~5 times until it digests 
all data), but then you get a decent overview on where your applications sends 
its time. The recently even added a thread view wheher you see exactly what 
your threads are doing.

Original comment by dominik....@gmail.com on 26 Aug 2011 at 11:25

GoogleCodeExporter commented 9 years ago

Thanks dominik - have you been able to get _wall-clock_ profiling with Sun 
Studio's profiler on Linux?  I can't install it to check for myself at the 
moment because the registration system seems to be broken, but I found this in 
the documentation, which seems to suggest that only CPU-clock profiling is 
available on Linux, not wall-clock profiling:

"Clock-based Profiling Under the Linux OS
Under the Linux OS, the only metric available is User CPU time. Although the 
total CPU utilization time reported is accurate, it may not be possible for the 
Analyzer to determine the proportion of the time that is actually System CPU 
time as accurately as for the Solaris OS. Although the Analyzer displays the 
information as if the data were for a lightweight process (LWP), in reality 
there are no LWP’s on a Linux OS; the displayed LWP ID is actually the thread 
ID."

http://download.oracle.com/docs/cd/E19205-01/819-5264/afabn/index.html

Original comment by jus...@fathomdb.com on 26 Aug 2011 at 4:30

GoogleCodeExporter commented 9 years ago

I created a (highly contrived) test program to try to show the differences, as 
part of my experiments.  I found that signals are indeed delayed (at least 
during an fsync), so that it is important to keep track of "ticks" (I haven't 
yet tested real-time signals, but I'm wary that they'll be very OS dependent):
https://github.com/justinsb/google-perftools/blob/db5260f52ca52ef136e13390e4df86
a4a47ba8c7/src/tests/profiler/test.cc

I am attaching sample outputs that shows the differences between the various 
mode.  In the test program, the main thread spawns two worker threads and 
waits.  Each worker thread does a mixture of I/O bound work (writing to a 
tempfile with multiple fsync calls), and some CPU bound work (sorting a big 
array without compiler optimization).

profile_normal.pdf: is with standard profiling (no extra flags).  The sort 
looks like the bottleneck (93%).  Calls to fsync and write do show up, but are 
<3%.

profile_rt.pdf: with CPUPROFILE_REALTIME=1.  100% is spent in pthread_join, 
because only the main thread is being profiled.  This isn't really fair on this 
mode, because it's only intended for single threaded code, but it does show the 
failure in multi-threaded code.  I had expected the background threads to show 
up here (even if they're undercounted), but I guess this is dependent on the 
implementation of signal delivery.

profile_timer.pdf: the proposed patch.  33% is spent in pthread_join, 45% is 
fsync, 20% in the sort.  This is very useful IMHO ... the fsync is a big 
bottleneck , but the sort could also use some attention (e.g. -O2).  The 33% / 
66% is because there are 3 threads; focusing in by method name is useful and 
then gives 100% as the total e.g. with "gv do_test".

I think that demonstrates nicely the benefits of the patch.  Additionally, if 
anyone is feeling charitable they could save me from the oracle registration 
system, and try profiling the test code to see how the results compare :-)

I'm going to look at cleaning up the patch now...

Original comment by jus...@fathomdb.com on 26 Aug 2011 at 6:59

Attachments:

GoogleCodeExporter commented 9 years ago

I had a go at cleaning up the patch, but I think I've found a slightly better 
approach than simply aiming to minimize the delta - abstracting the setitimer 
code using a strategy pattern.  I've attached a patch that just shows the 
changes to the current profile-handler.cc.  What I like is that the patch is 
pretty minimal, but the code is more flexible (I can plug in my thread timer 
approach; hardware events and usercode events may be possible).  But the real 
win, IMHO, is that it's a bit nicer in that it separates out the (complicated) 
timer logic from the profiling logic - the timer_sharing_ complexity is now 
safely hidden inside the TimerProfilerEventSource strategy class.

The thread interval timer is therefore now in its own file; it's here on github 
along with the rest of the latest code:
https://github.com/justinsb/google-perftools/blob/d95de4d640ae266843a7af9f5c431a
3ff55433d9/src/profile-eventsource-timerthread.cc

I probably should move some comments around, but then the patch would be much 
bigger and the code changes would be a bit obscured.  The timer event strategy 
could even be moved into its own file, though again the patch would just become 
much bigger.

Original comment by jus...@fathomdb.com on 26 Aug 2011 at 9:24

Attachments:

create-strategy.diff

GoogleCodeExporter commented 9 years ago

Thanks for sharing this!  Your new approach looks promising.  I ran out of time 
to look at it this week, but will try to get to it early next.

Original comment by csilv...@gmail.com on 27 Aug 2011 at 12:54

GoogleCodeExporter commented 9 years ago

Thanks!  I'm implementing hardware performance tracing as well, so might be 
worth holding off for a day or two - it looks like that'll need some further 
refactoring because (1) hardware events don't use signals and (2) I'm guessing 
you won't want a big perf-events chunk of code in the core library so I'm 
looking at dynamic loading.

Original comment by jus...@fathomdb.com on 29 Aug 2011 at 5:48

GoogleCodeExporter commented 9 years ago

OK, I'll hold off for now.  Let me know when you're ready for a look!

Original comment by csilv...@gmail.com on 29 Aug 2011 at 9:09

GoogleCodeExporter commented 9 years ago

OK, I have a nicer set of patches, I think:

The "big one" is that event sources are implemented using a strategy pattern 
(as in the above diff).  There's a bit more cleanup in that branch: 
https://github.com/justinsb/google-perftools/commits/realtime_threads

Then, I let the strategy implementation be in a dynamically loaded library, so 
that we can have complicated strategies without bloating perftools. We could 
also pull out wall-clock profiling e.g. "incubate" it outside of "core" 
perftools if you'd like.
https://github.com/justinsb/google-perftools/commit/b19a538c40d0e0d1e1cdc8dc307b
28a701309f45

I then changed the callbacks so that they take a backtrace, rather than being 
tightly bound to signals:
https://github.com/justinsb/google-perftools/commit/18ca848284d3e6a30ec97e741a48
38a4ac329d4d

---

In addition to supporting wall-clock profiling, I've just open-sourced a 
library which makes it easy add the perftools HTTP interface (including symbol 
resolution via pprof).  It also supports hardware tracing using the Linux "perf 
events" system, which is now shipping in the mainline kernel.  (If you would 
like to pull any of this into google-perftools I'm happy to make whatever 
changes are needed, but I'm guessing you want to keep perftools lean, at least 
until this code has matured.)

https://github.com/justinsb/fathomdb-google-perftools-extensions/tree/master/src
/main/cpp/fathomdb/perftools

Original comment by jus...@fathomdb.com on 29 Aug 2011 at 9:19

GoogleCodeExporter commented 9 years ago

Thanks, this sounds really interesting!

I don't really know how to use github; is there a way you can provide these as 
patches against perftools 1.8.3?  (Or any 1.8.x; they're all probably the same 
as far as this part of the codebase is concerned.)

Original comment by csilv...@gmail.com on 30 Aug 2011 at 12:42

GoogleCodeExporter commented 9 years ago

Sure - while I like github for viewing code changes, I'm happy to provide 
whatever works for you.  I've attached an "all in one" patch.

I realize I haven't exactly achieved the goal of a minimal patch, though I do 
think these are each individually positive changes.  One option I was 
considering was to maintain a temporary fork of the profiler, let it mature 
etc, and then ideally merge it back in to google-perftools.  I do think that 
hardware profiling is a very valuable addition now that it's in the mainline 
kernel; however I can well imagine that we might want other features in future 
(e.g. profiling multiple hardware events and looking at ratios), which might 
well require extensions to the profile format.  This patch already has a 
breaking change to the callback function signature for profiler event handlers.

Let me know what you think is best; I can also split this into separate patches 
if need be.

Original comment by jus...@fathomdb.com on 30 Aug 2011 at 10:32

Attachments:

all_in_one.patch

GoogleCodeExporter commented 9 years ago

Thanks for the patch!  I like the idea of adding an abstraction layer here (I 
think :-) ).  Do you mind splitting up the patch into two parts? -- one of 
which adds the new abstraction layer and rewrites the existing cpu-based timer 
on top of it, and one of which does everything else.

I may need that second patch to be broken up further in the future, but first 
thing first: I will need to get buy-in on the idea of refactoring the profiler 
in this way, from the main authors of that part of the code (none of which are 
me...)  Keeping it a simple refactoring, with no new functionality, will 
probably make it easier to see.  But keeping the second patch to show as a 
proof-of-concept as to why the refactoring is helpful, would be good as well.

Also, to the extent it's easy, it would be great if you could reformat the 
patch to match existing style by doing (x == 0) comparisons rather than (0 == 
x).  (There may be other small style differences, but that's the one I noticed 
off-hand).

Original comment by csilv...@gmail.com on 30 Aug 2011 at 10:58

GoogleCodeExporter commented 9 years ago

I broke it down into 4 patches:

Patch #1 is the simple refactoring patch (as you requested); the timer event 
source strategy is separated into its own class
Patch #2 changes the callback format to take a stacktrace rather than being a 
signal handler.  This is a breaking change, and makes sense from my 
perspective, but I don't know the uses to which this callback has been put 
within Google.
Patch #3 adds the timer-thread callback strategy.  Technically this doesn't 
depend on #2, but it does involve changing the callback anyway (to include the 
'tick count'), so it makes sense to break the callback once, not twice!
Patch #4 allows event sources to live in external libraries, which is used for 
my hardware profiler interface.

Original comment by jus...@fathomdb.com on 6 Sep 2011 at 9:30

Attachments:

GoogleCodeExporter commented 9 years ago

Thanks!  I've passed the first patch along to the profiler authors for review 
(with some minor tweaks to get rid of things like default arguments, which we 
try not to use).

One thing: can you sign the CLA at
   http://code.google.com/legal/individual-cla-v1.0.html
?

We'll need that before we can incorporate the code.

Original comment by csilv...@gmail.com on 10 Sep 2011 at 4:39

GoogleCodeExporter commented 9 years ago

I'm sorry for the delay.  I got two people to give some good feedback on the 
first patch; I was hoping for a third but have officially timed out. :-)

I'm putting their comments (unedited) below.  The line numbers may not match up 
exactly because of small changes I made to the license text/etc.  Hopefully 
there's enough context to figure out what's going on.  Feel free to draw up a 
new version of the patch based on their feedback.  You can also make responses 
here if there's stuff you don't want to change.

REVIEWER 1

========================================================================
https://mondrian.corp.google.com/file/23841364///depot/google3/base/profiler_eve
ntsource.h?a=1
File //depot/google3/base/profiler_eventsource.h (snapshot 1)
------------------------------------
Line 34: // source. Currently this can only be done from C++.
This looks like it can be safely removed or merged with the class comment.
------------------------------------
Line 38: // The ProfileEventSource is a strategy pattern for producing events
I'd prefer to have more info for an uninitiated reader. So that it's clearer how
to use this class. E.g. something like this:

To provide a custom .... for CPU profiler, one can implement the
ProfileEventSource interface and then give an object of the derived class to ...
We currently use this method to ...
In the future this can also be used to ...
------------------------------------
Line 63: // no callbacks are currently registered.
This is a strage description for a pure-virtual function.
Does the text describe requirements on any implementation?
If yes, we need to make that clearer and provide some help about how one should
go about "register ... with the profile handler", checking if "timers are shared
by all threads", etc.
If the text descibes a specific implementation in a specific derived class, then
it's not clear why the text is here with the pure-virtual API declaration.
------------------------------------
Line 68: // (but might not be, e.g. if we're don't yet know if timers are 
shared)
What callback? what is "underlying event source"? how do we use/check the fact
that it "could be stopped, but might not be"?

I think the comments in this .h must be dramatically improved so that one can
understand what's going on pretty much by looking at this file in isolation.
------------------------------------
Line 75: // Gets the signal that should be monitored, if one should be.
Is this talking about unix signal numbers like SIG_FOO?
Is 0 the special value "no signal"? can we use some macro constant instead or is
it known for sure that all SIG_FOO are non-0?

What impact does it have when this function is overriden to request monitoring
of a signal?
------------------------------------
Line 79: // stops simply sending events; the timer technique doesn't do
I have no idea what "sending of events" is and how a derived class would
start/stop it.

REVIEWER2
========================================================================
https://mondrian.corp.google.com/file/23841364///depot/google3/base/profiler_eve
ntsource.h?a=1
File //depot/google3/base/profiler_eventsource.h (snapshot 1)
------------------------------------
Line 63: // no callbacks are currently registered.
Agreed; need to clean up this description. The requirement is that each thread
in the process must be registered with the profiler exactly once (including the
main thread).
------------------------------------
Line 63: // no callbacks are currently registered.
What is callback count? Why do you need it? IIUC, you are using callback_count_
to enable/disable timers. Why not use Enable()/Disable() APIs below.
------------------------------------
Line 72: // Resets state to the default.
What state? what is a use case?
------------------------------------
Line 75: // Gets the signal that should be monitored, if one should be.
What if it is a user defined signal? This is used in profile-handler.cc to set
appropriate handler (sigaction(...). Shouldn't implementation of this interface
set both the timer and the corresponding signal handler?
------------------------------------
Line 78: // Allow best-effort low-cost suppression of events. The thread
What do you mean by 'best-effort'. What are the exact semantics here. Is it
guaranteed that no events will be generated after a successful return from this
call? Can this be called in the context of 'event'?

Original comment by csilv...@gmail.com on 14 Sep 2011 at 1:12

GoogleCodeExporter commented 9 years ago

Here's a revised patch with comments that are much clearer I hope.  I was 
trying to minimize the diff size, but I've now moved comments around so that I 
think they make a lot more sense.

I do agree with the reviewers that it is a bit odd that the code 
enables/disables the timer as well as enabling/disabling the signal handler.  
I'm not entirely sure why we need to do both (and also why we don't just use a 
flag e.g. suppress_events_ rather than making syscalls), but I didn't want to 
change the existing timer code.

I signed the CLA the other day also.

Hope this is closer!

Original comment by jus...@fathomdb.com on 14 Sep 2011 at 10:07

Attachments:

just-refactor-v2.patch

GoogleCodeExporter commented 9 years ago

Great, thanks much!  I'm heading on vacation in a few hours, but I'll let them 
know about the new patch.

Original comment by csilv...@gmail.com on 14 Sep 2011 at 10:36

GoogleCodeExporter commented 9 years ago

btw, here are some style-type cleanups I made myself:

1) We don't like default parameters that much, so I took out the one that was 
defaulting to string() and just pass in "" manually.

2) Many comments were longer than 80 columns.  In general, we try to wrap 
comments at 72 columns or so, to look nice.

3) We also prefer to end full sentences with periods.

There was also one code-correctness issue that affects us inside google but not 
in opensource: you need to get rid of the GUARDED_BY(control_lock_)s inside the 
TimerProfilerEventSource object, and instead add GUARDED_BY(control_lock_) at 
the definition of event_source_.

Nothing big, but wanted to let you know.  Will ask the others for further 
feedback.

Original comment by csilv...@gmail.com on 14 Sep 2011 at 10:52

GoogleCodeExporter commented 9 years ago

Fair enough - I was targeting 80 columns, but I do have a habit of sneaking 
over the line;  I'll switch to 72.  I'll also mirror the two code changes you 
made.

Have a great vacation!

Original comment by jus...@fathomdb.com on 14 Sep 2011 at 10:55

GoogleCodeExporter commented 9 years ago

OK, I've heard back from the third reviewer, who has this to say about your v2 
patch (I haven't heard back from the others yet).  He didn't look at the whole 
patch series, just this one patch, so some of these comments may reflect that.
---
What's the model here?  I am a little lost.  Given a name like
"ProfileEventSource", I would expect a mechanism by which
an instance of this class delivers events to some concrete data
gathering code.  I don't see anything like that in the interface.
Is this really only suitable for profiling events that are delivered
via signal handlers?  If so, it seems like the class is misnamed.
How about something like:

   class ProfileSignalSource

I don't quite understand why the interface is the way it is.
E.g., why is there internal state that needs to be cleared via Reset?
I would have expected just the following things as the most important
methods:

    get signal#
    enable signals
    disable signals

If this is signal based, why doesn't the caller know how to enable
and disable signals since it already has to deal with signal handler calls?
I.e., why are there virtual EnableEvents/DisableEvents in this class.

Also, did you consider making something a lot more declarative.  For example,
just a struct that contains the properties of the source like the signal number\
.
This code will be invoked from deep with async-signal-safe code, and I suspect
it would be better if all logic was isolated in the caller instead of spread ou\
t
across the caller and multiple implementations of an abstract class.

If there is a RegisterThread, why us there no UnregisterThread?

Overall, I am not so sure about this approach.  Why shouldn't other
profilers be built out of lower-level reusable non-abstract interfaces
(like GetStackTrace, ProfileData, etc.)?  If there are other parts
of profiler.cc that would be common, maybe they could be pulled
out.  The guts of a real-time profiler which needs very different
logic (like per-thread registration) would be separate from the
simpler top-level flow of the cpu-profiler.

I am finding it harder and harder to keep track of the flow through the
existing code.  Unless there is some big simplification of the existing
code, I don't want to make the flow even harder to follow.
---

Original comment by csilv...@gmail.com on 20 Sep 2011 at 10:48

GoogleCodeExporter commented 9 years ago

I agree with a lot of that; I think most of the comments stem from the fact 
that this is just the first of a series of patches.  I think if the series were 
taken as a whole then many of the comments would be addressed (particularly if 
they checked out the hardware event monitoring)

However, I think the refactoring is a vast improvement in terms of 
understanding the logic of what's going on in the profiler.  The comments about 
Reset and UnregisterThread are fair questions to ask of the original code!  
Where there are oddities in the code with the patch, I think generally this is 
exposing opportunities to clean up the profiler (e.g. why does the itimer get 
disabled _and_ the signal handler removed?)

Removing the signal handling from the ProfileHandler entirely might be a step 
forward, by moving it into the timer event source.

Can we try asking for their feedback on the end-product, rather than 
step-by-step?  There was fair feedback about the comments, but otherwise it 
seems that many of the comments are focusing on oddities that are either 
present in the original code (and therefore preserved by the straight 
refactoring), or are questioning where the patch is going.

I could create a single patch which moved all the signal handling pain into a 
new file which isolated out the timer event source, and would generally make 
the code much easier to understand (even as compared to the current code IMHO.) 
 The code would be better organized, but because big chunks of code would be 
moving around it wouldn't be as clear where everything came from.

Original comment by jus...@fathomdb.com on 21 Sep 2011 at 12:33

GoogleCodeExporter commented 9 years ago

I'm not sure exactly what to tell you -- I don't want to make you do lots of 
busywork, but I'm also not the expert on this code and can't say what would be 
enough to get the original authors to sign off on it and what wouldn't.  I 
think they're afraid of getting too pulled into this (they're all on other 
projects now), so I'm doing the best I can to make life easy for both sides.

If it's not too much trouble, I think it would be helpful to have a single 
patch that ends up with the code looking as clear as possible, given the 
comments the third reviewer had.  I suspect he'll just read the new file afresh 
(rather than looking at diffs; I'll provide both), which will hopefully help in 
understanding.

I appreciate your patience with this.  I think we're on the right course!

Original comment by csilv...@gmail.com on 21 Sep 2011 at 1:20

GoogleCodeExporter commented 9 years ago

Ok, I've had a lot of discussion with the reviewers about this patch, and the 
general consensus has been that it's too invasive to the design of the profiler 
without giving much benefit inside Google.  The suggestion is to keep the patch 
available out-of-tree (or in a forked copy of the profiler).

I'm definitely glad to see the patch exists; I only feel bad for making you do 
the work to split up the patch and refactor when it still wasn't enough to get 
the code in.  I'll close this bug WillNotImplement, but still keep it around so 
if anyone else has need for the same functionality, they can find it here.

Original comment by csilv...@gmail.com on 11 Oct 2011 at 12:52

Changed state: WontFix

GoogleCodeExporter commented 9 years ago

Fair enough - sorry I haven't had the chance to do the revised patch yet.  I 
hope to get around to doing that 'real soon now', but it probably does make 
sense to keep this separate anyway while it matures as the code is plainly at a 
different stage in the code maturity cycle.  Maybe sometime in the future, when 
both the perf infrastructure and the patch are more mature, it'll be more 
compelling.

Original comment by jus...@fathomdb.com on 11 Oct 2011 at 2:56