theRockLiu / gperftools

Automatically exported from code.google.com/p/gperftools
BSD 3-Clause "New" or "Revised" License
1 stars 0 forks source link

pprof runs out of memory while reading cpuprof data #165

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. a very long running trace from cpuprof data (300 MB or more)
2. use pprof to generate a report from it
3. pprof runs out of memory in ReadCPUProfile() in 32-bit Perl

What is the expected output? What do you see instead?

pprof should not try to read the entire file at once and unpack it in
memory, this is very inefficient and capacity-limiting

What version of the product are you using? On what operating system?

0.93 on Linux (Red Hat Enterprise 3)

Please provide any additional information below.

I modified pprof's ReadCPUProfile subroutine to use a "sliding window"
approach to reading the file. Here is the patch (against pprof 0.93 since I
still haven't upgraded to 1.3 due to tcmalloc performance degradation) I
would like to see this bugfix integrated into a future release of pprof.
Thank you

Index: pprof
===================================================================
RCS file: scripts/pprof,v
retrieving revision 1.3
retrieving revision 1.4
diff -b -p -u -r1.3 -r1.4
--- pprof       10 Jun 2009 22:35:14 -0000      1.3
+++ pprof       11 Aug 2009 20:16:54 -0000      1.4
@@ -2054,15 +2054,70 @@ sub ReadProfile {
   }
 }

+
+# Don't read everything at once, it'll OOM easily.
+# Stream in chunks along a sliding window with small lookback overlap.
+BEGIN {
+  package CPStream;
+
+  sub new {
+    my ($class, $file) = @_;
+    my $self = { file    => $file,
+                 base    => 0,
+                 stride  => 524288,
+                 overlap => 1024,
+                 slots   => [] };
+    bless $self, $class;
+    $self->overflow();
+    return $self;
+  }
+
+  sub overflow {
+    my ($self) = @_;
+    my $slots = $self->{slots};
+    my $slotsize = $#$slots + 1;
+    my $jumpsize = $slotsize - $self->{overlap};
+    if ($jumpsize > 0) {
+      splice(@$slots, 0, $jumpsize);
+      $self->{base} += $jumpsize;
+    }
+    my $str;
+    my $nbytes = read($self->{file}, $str, $self->{stride});
+    if ($nbytes > 0) {
+      push(@$slots, unpack("L*", $str));
+    }
+  }
+
+  sub get {
+    my ($self, $idx) = @_;
+    my $slots = $self->{slots};
+    while ($#$slots >= 0) {
+      my $cur_min = $self->{base};
+      last if ($idx < $cur_min);
+
+      my $cur_max = $cur_min + $#$slots;
+      if ($idx <= $cur_max) {
+        my $offset = $idx - $cur_min;
+        return $slots->[$offset];
+      }
+      $self->overflow();
+    }
+    return 0;
+  }
+}
+
+
 # CPU profile reader
 sub ReadCPUProfile {
   my $prog = shift;
   my $fname = shift;

-  # Read entire profile into a string
-  my $str;
-  my $nbytes = read(PROFILE, $str, (stat PROFILE)[7]);   # read entire file
-  close(PROFILE);
+  # Read entire profile into a string --> NO, BAD IDEA!!
+  #my $str;
+  #my $nbytes = read(PROFILE, $str, (stat PROFILE)[7]);   # read entire file
+  #close(PROFILE);

   my $version;
   my $period;
@@ -2075,7 +2130,9 @@ sub ReadCPUProfile {
   #1) a valid 64-bit profile to use the 32-bit codepath, and
   #2) a valid 32-bit profile to be unrecognized.

-  my @slots = unpack("L*", $str);
+  #my @slots = unpack("L*", $str);
+
+  my $slots = CPStream->new(*PROFILE);

   # Read header.  The current header version is a 5-element structure
   # containing:
@@ -2093,21 +2150,21 @@ sub ReadCPUProfile {
   #   32-bit:  0 3 0 100 0
   #   64-bit:  0 0  3 0  0 0  100 0  0 0
   #
-  if ($#slots < 4 || $slots[0] != 0 ) {
+  if ($slots->get(0) != 0 ) {
     error("$fname: not a profile file, or old format profile file\n");
   }
-  if ($slots[1] >= 3) {
+  if ($slots->get(1) >= 3) {
     # Normal 32-bit header:
-    $version = $slots[2];
-    $period = $slots[3];
-    $i = 2 + $slots[1];
+    $version = $slots->get(2);
+    $period = $slots->get(3);
+    $i = 2 + $slots->get(1);
     $address_length = 8;

     # Parse profile
-    while ($i <= $#slots) {
-      my $n = $slots[$i++];
-      my $d = $slots[$i++];
-      if ($slots[$i] == 0) {
+    while ($slots->get($i) != 0) {
+      my $n = $slots->get($i++);
+      my $d = $slots->get($i++);
+      if ($slots->get($i) == 0) {
         # End of profile data marker
         $i += $d;
         last;
@@ -2116,7 +2173,7 @@ sub ReadCPUProfile {
       # Make key out of the stack entries
       my @k = ();
       for (my $j = 0; $j < $d; $j++) {
-        my $pc = sprintf("%08x", $slots[$i+$j]);
+        my $pc = sprintf("%08x", $slots->get($i+$j));
         $pcs->{$pc} = 1;
         push @k, $pc;
       }
@@ -2128,28 +2185,31 @@ sub ReadCPUProfile {
   # Normal 64-bit header:  All entries are doubled in size.  The first
   # word (little-endian) should contain the real value, the second should
   # be zero.
-  } elsif ($#slots < 9 || $slots[1] != 0 || $slots[2] < 3 || $slots[3] != 0
-        || $slots[5] != 0 || $slots[7] != 0) {
+  } elsif ($slots->get(1) != 0 ||
+           $slots->get(2)  < 3 ||
+           $slots->get(3) != 0 ||
+           $slots->get(5) != 0 ||
+           $slots->get(7) != 0) {
       error("$fname: not a profile file, or old format profile file\n");
   } else {
-    $version = $slots[4];
-    $period = $slots[6];
-    $i = 4 + 2 * $slots[2];
+    $version = $slots->get(4);
+    $period = $slots->get(6);
+    $i = 4 + 2 * $slots->get(2);
     $address_length = 16;

     # Parse profile
-    while ($i <= $#slots) {
-      my $n = $slots[$i++];
-      my $nhi = $slots[$i++];
+    while ($slots->get($i) != 0) {
+      my $n = $slots->get($i++);
+      my $nhi = $slots->get($i++);
       # Huge counts may coerce to floating point, keeping scale, not precision
       if ($nhi != 0) { $n += $nhi*(2**32); }
-      my $d = $slots[$i++];
-      if ($slots[$i++] != 0) {
+      my $d = $slots->get($i++);
+      if ($slots->get($i++) != 0) {
         my $addr = sprintf("%o", 4 * $i);
         print STDERR "At index $i ($addr):\n";
         error("$fname: stack trace depth >= 2**32\n");
       }
-      if ($slots[$i] == 0 && $slots[$i+1] == 0) {
+      if ($slots->get($i) == 0 && $slots->get($i+1) == 0) {
         # End of profile data marker
         $i += 2 * $d;
         last;
@@ -2158,8 +2218,8 @@ sub ReadCPUProfile {
       # Make key out of the stack entries
       my @k = ();
       for (my $j = $d; $j--; ) {
-        my $pclo = $slots[$i++];
-        my $pchi = $slots[$i++];
+        my $pclo = $slots->get($i++);
+        my $pchi = $slots->get($i++);
         my $pc = sprintf("%08x%08x", $pchi, $pclo);
         $pcs->{$pc} = 1;
         push @k, $pc;
@@ -2169,7 +2229,12 @@ sub ReadCPUProfile {
   }

   # Parse map
-  my $map = substr($str, $i * 4);
+  #my $map = substr($str, $i * 4);
+
+  my $map = '';
+  seek(PROFILE, $i * 4, 0);
+  read(PROFILE, $map, (stat PROFILE)[7]);
+  close(PROFILE);

   my $r = {};
   $r->{version} = $version;

Original issue reported on code.google.com by vvv...@gmail.com on 28 Aug 2009 at 7:02

GoogleCodeExporter commented 9 years ago
0.93 is quite old.  Can you try again with perftools 1.3 -- I think the file 
format
hasn't changed since 0.93, so just using the newer pprof should probably 
suffice --
and see if you have the same problem?  I must say this report surprises me.  We 
use
this functionality a lot within google, on very large binaries with lots of CPU
activity, and have never had a problem running out of space.  How big is your
.profile?  Do you have a simple use-case -- binary plus command-line to run it 
--
that you could share that illustrates the problem?

The fix looks pretty straightforward, and I'm not opposed to applying it, but I 
want
to understand first whether this is just a symptom of some other, underlying 
problem.

Original comment by csilv...@gmail.com on 31 Aug 2009 at 5:35

GoogleCodeExporter commented 9 years ago
Also, in case this patch does end up being the right solution, can you
(electronically) sign the CLA at
   http://code.google.com/legal/individual-cla-v1.0.html

Thanks!
craig

Original comment by csilv...@gmail.com on 31 Aug 2009 at 5:36

GoogleCodeExporter commented 9 years ago
Yes I had the same problem as of 1.2. I took a glance at 1.3 and the code is the
same. I use pprof with just the binary and the profile, no fancy options. The
fundamental problem is this:

  my $nbytes = read(PROFILE, $str, (stat PROFILE)[7]);   # read entire file
  ...
  my @slots = unpack("L*", $str);

OOM happened when the file size went beyond 300-400 MB. For me that came from a 
trace
of 4-5 hours of a particular software. 32-bit Perl went OOM after 3.5+ GB. 
64-bit
Perl did better but it needed 30-40 GB total and nearly 30 minutes just to get 
pack
the unpack() step.

So my patch is to address both the memory & runtime bloat.

Thanks for considering it.

Original comment by vvv...@gmail.com on 3 Sep 2009 at 12:25

GoogleCodeExporter commented 9 years ago
OK, I'm glad to know it's still in the current versions of pprof.

I understad that the problems you are seeing are because the profile file is 
really
big.  But what I don't understand is why it's so big.  While I'm not the most 
expert
at this part of the system, it seems to me these profiles should typically be 
small,
even for executables that run for a long time.  That's our experience, at least.

Do you have a simple example you can show, that generates such large profile 
files? 
That would help us understand if there's some other problem causing too-large 
profile
files or not.  Then we can know if this patch is the best fix, or something 
else.

Original comment by csilv...@gmail.com on 3 Sep 2009 at 1:50

GoogleCodeExporter commented 9 years ago
In the meantime, I've tried applying the patch, and are asking other folks here 
what
they think about it, so if it is the best solution, we'll be ready to go with 
it.

Original comment by csilv...@gmail.com on 3 Sep 2009 at 3:51

GoogleCodeExporter commented 9 years ago
I think the reason for my very large profile file is because of the deep 
stack-trace
in my software (at least 20-30 functions deep, up to 100-120). My application 
has a
built-in Tcl interpreter and many commands are extended with C++ underneath. 
The C++
code can be very deep sometimes. Each Tcl command nesting can introduce several
function layers. So that all adds up to quite a bit of stacking. Huge credit 
goes to
tcmalloc and pprof for handling such a large software without hiccup! except 
this one :-)

Original comment by vvv...@gmail.com on 4 Sep 2009 at 3:12

GoogleCodeExporter commented 9 years ago
We ended up getting rid of the look-behind, which isn't really needed, and made 
a few
other tweaks, but this basic functionality is in perftools 1.4, just released. 
Thanks for the patch!

Original comment by csilv...@gmail.com on 11 Sep 2009 at 6:59