Closed dmontag closed 6 years ago
I take that back – after left running, Python 3.6 seems to exhibit the same behavior of spiking CPU, though the number of threads hasn't increased like it did on 3.5.
Don't know if this helps, but here's the gdb thread printout:
(gdb) info threads
Id Target Id Frame
* 1 Thread 0x7f0d6e59d700 (LWP 3781) "python" 0x00007f0d6d34a593 in select () at ../sysdeps/unix/syscall-template.S:84
2 Thread 0x7f0d67ee7700 (LWP 3807) "python" pthread_cond_timedwait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:225
3 Thread 0x7f0d686e8700 (LWP 3808) "python" pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
4 Thread 0x7f0d65c22700 (LWP 3850) "python" 0x00007f0d6e179827 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x7f0d60000ad0) at ../sysdeps/unix/sysv/linux/futex-internal.h:205
5 Thread 0x7f0d66423700 (LWP 3851) "python" 0x00007f0d6e179827 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x7f0d50029540) at ../sysdeps/unix/sysv/linux/futex-internal.h:205
6 Thread 0x7f0d4ffff700 (LWP 3869) "python" pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
7 Thread 0x7f0d653e1700 (LWP 5052) "python" 0x00007f0d6d34870d in poll () at ../sysdeps/unix/syscall-template.S:84
8 Thread 0x7f0d4f7fe700 (LWP 5053) "python" 0x00007f0d6e179827 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x7f0d3c0016c0) at ../sysdeps/unix/sysv/linux/futex-internal.h:205
ps auxw -L
showed LWP 5052 as using all the CPU, with the others being at 0:
USER PID LWP %CPU NLWP %MEM VSZ RSS TTY STAT START TIME COMMAND
david 3781 3781 0.0 8 1.5 720244 60384 ? tl 12:46 0:01 python /home/david/script.py
david 3781 3807 0.0 8 1.5 720244 60384 ? tl 12:46 0:00 python /home/david/script.py
david 3781 3808 0.0 8 1.5 720244 60384 ? Sl 12:46 0:00 python /home/david/script.py
david 3781 3850 0.0 8 1.5 720244 60384 ? Sl 12:46 0:00 python /home/david/script.py
david 3781 3851 0.0 8 1.5 720244 60384 ? Sl 12:46 0:00 python /home/david/script.py
david 3781 3869 0.0 8 1.5 720244 60384 ? Sl 12:47 0:00 python /home/david/script.py
david 3781 5052 96.7 8 1.5 720244 60384 ? Rl 14:01 0:11 python /home/david/script.py
david 3781 5053 0.0 8 1.5 720244 60384 ? Sl 14:01 0:00 python /home/david/script.py
Don't know if this helps, but figured I'd add it here in case something stands out.
@dmontag Thanks for this example. I'll try to reproduce today.
I'm hoping that the recent releases (in particular 0.29.4
) have resolved this.
@dmontag It's worth noting that the 3.6/2.7
special branch is just to control how the thread pool executor names the threads in the pool. This feature was added in 3.6
New in version 3.6: The _thread_nameprefix argument was added to allow users to control the threading.Thread names for worker threads created by the pool for easier debugging.
It is also present in the 2.7
backport.
@dmontag So I am running a modified version of this, but it's essentially the same. I even attached the topic to bucket change notifications, though that doesn't seem relevant to the issue.
In google-cloud-pubsub==0.29.2
, I am seeing a Thread-gRPC-ConsumeRequestIterator
turned into a zombie every 60s or so. These are threads spawned by gRPC as part of our bidirectional consumer. They stay open because we fail to send STOP
to the queue they are attached to. This behavior was fixed by #4503 and included in the 0.29.4
release.
I am going to pre-emptively close this issue (while I continue debugging). As of right now, my example has been running (with 0.29.2
, since I'm trying to reproduce) for just over 12 minutes and the number of threads has grown to 17, but CPU usage is still hovering near 0%.
Would you mind testing with 0.29.4
in your "simulated conditions" and see if the issue is resolved?
@dhermes Thanks for checking this one out. I tested on 0.29.4 yesterday, and it exhibits the same behavior of having a thread eventually spike and run at 100% sustained CPU – it takes about 65-70 minutes before the condition presents though. I updated my comment on #4274 to reflect this: https://github.com/GoogleCloudPlatform/google-cloud-python/issues/4274#issuecomment-350780990
For now I'll have to run on 0.27.0 which has been running ok for the past 20-or-so hours. The Stackdriver dashboard reports Pull requests for 0.27 vs StreamingPull requests for 0.29 – maybe that's a place to start looking for differences on the client side.
Though I am loosely familiar with many Google APIs, I don't have deep knowledge of many of them. Would you mind pointing to docs for how I might create such a Stackdriver dashboard? (I have a general idea, but am overwhelmed by the sheer number of metrics for each Pub / Sub resource type in the metrics explorer.)
Thanks for continuing to help debug this. I wonder if it'd be possible for us to get a shared piece of code that we are both running? I've been tracking my experiments here https://github.com/dhermes/google-cloud-pubsub-performance and have yet to reproduce a CPU spike (though 5m is the longest anything has run so far).
For Stackdriver you basically click Monitoring on your GCP project in the regular GCP menu, go through the setup wizard, then either:
Go to Resources -> Cloud Pub/Sub, then click the topic, then click the subscription, and then it'll show the operations & requests on the right-hand side. Both kinds of operations need to be present in the stats for the charts to show.
Create a custom dashboard in the Stackdriver menu, then add a new chart for PubSub Subscription, add the StreamingPull Operations and Pull Message Operations dimensions, and save the chart. Again, you need both event types to have occurred to see the stats for both. It'll look something like this:
Not sure if that'll help track down the bug though, it was mostly a thought.
I can try to provide you access to an instance with my exact script toward the end of the week, and hopefully that might let you observe the behavior up close.
@dmontag I have at least been able to reproduce this (in 0.29.4
) with a subscription that never receives a message. The CPU usage spikes somewhere between running 1h6m29s (0.4%), 1h6m34s (76.6%) and 1h6m39s (100.1% and stable at ~100
from here on out).
timeLevel=03989775:INFO
Heartbeat:
psutil info=
CPU usage=0.4%
----------------------------------------
timeLevel=03994789:INFO
Heartbeat:
psutil info=
CPU usage=76.6%
----------------------------------------
timeLevel=03999802:INFO
Heartbeat:
psutil info=
CPU usage=100.1%
I currently have no idea how this is happening. I'm using the Unix gettid
command to add a tracking hook in threading.Thread
and the CPU usage is coming with only 8 active pthreads owned by the parent.
Sounds exactly like what I'm seeing. Good to hear that it's a reproducible problem – that's the first step toward fixing it :-)
I think I've tracked it down. In the exact "spin worker" thread where the CPU usage begins to spike to 100%, there is a thread that checks the credentials. This is a sign that the credentials are being refreshed. This makes sense, since an access token has a lifetime of an hour plus some wiggle room.
/cc @jonparrott
Very interesting, but why does refreshing the creds cause a sustained cpu spike?
At least now it should be easier to debug, you can fake credentials expiring.
I think the 60 minute mark / extra credentials check was a false alarm signal. I have verified with strace
that threads using 100% CPU are stuck in the POLLIN/POLLHUP
dance. I have also run with GRPC_TRACE=all
and GRPC_VERBOSITY=debug
so that may yield some ideas. (UPDATE: ZOMG the captured gRPC trace logs went from 64MB to 37GB once the CPU thrashing started. Holy moly.)
Seems like https://github.com/grpc/grpc/pull/13665/ is an attempt at a stop-gap.
We may be able to address by tracking CPU usage and killing an active subscription based on this, but that wouldn't be very fun.
Interesting indeed. Any idea why 0.27 doesn't exhibit this? (I'm not familiar with the library internals)
Any idea why 0.27 doesn't exhibit this? (I'm not familiar with the library internals)
It doesn't use a StreamingPull
.
Do you think it's reproducible without PubSub-specific code? I.e. if a test case can be written that only uses gRPC to reproduce it?
@dmontag Yes, see https://github.com/grpc/grpc/issues/9688
This issue actually corresponds to two distinct parts.
The first has been resolved (e.g. by #4444, #4472, #4498, #4503, #4554, #4558, etc.) and the second has it's own issue (I just made it) at #4600.
Hi,
I have searched for answers but haven't found anything matching this issue. The closest was #3965 but it might be a different problem.
My code first initializes a subscriber client and storage client at startup like so:
After ensuring it has a subscription it starts listening:
No messages are ever sent on the topic, this is all on a completely idle system.
Eventually the thread count starts to creep up, and after an hour or two CPU usage spikes to 100%. I left it running overnight with some code in the loop printing all thread stacktraces. First I checked how many threads were running:
Here's an excerpt of the stacktraces printed, and it goes on and on:
After checking out the google-cloud-python source code I noticed that the thread policy_class makes a distinction between Python 3.5 and 3.6:
https://github.com/GoogleCloudPlatform/google-cloud-python/blob/af2e6cac62fc9d380bac780fe63b1416ef94da48/pubsub/google/cloud/pubsub_v1/subscriber/policy/thread.py#L139
I installed Python 3.6.3 and tried it out, and it doesn't seem to exhibit the same behavior – the number of threads stays limited. So in that sense I have a workaround, but I figured I would still report this in case someone else is seeing a similar problem.