googleapis / google-cloud-python

Google Cloud Client Library for Python
https://googleapis.github.io/google-cloud-python/
Apache License 2.0
4.81k stars 1.52k forks source link

PubSub with Python 3.5 spawns threads that don't die, eventually bringing the process to consume all CPU available #4563

Closed dmontag closed 6 years ago

dmontag commented 6 years ago

Hi,

I have searched for answers but haven't found anything matching this issue. The closest was #3965 but it might be a different problem.

  1. I'm using the Subscriber Client Python API.
  2. OS is Ubuntu 16.04.1 LTS (stock image off GCE)
  3. Python version is 3.5.2 (set up with virtualenv)
  4. google-cloud version 0.31.0, google-cloud-pubsub version 0.29.2, google-cloud-storage version 1.6.0

My code first initializes a subscriber client and storage client at startup like so:

storage_client = storage.Client(project=PROJECT_ID)
subscriber = pubsub.SubscriberClient()

After ensuring it has a subscription it starts listening:

def listen_subscription(callback):
  """Listen for change events on GCS topic using supplied callback function"""
  subscription_path = subscriber.subscription_path(PROJECT_ID, SUBSCRIPTION_NAME)
  flow_control = pubsub.types.FlowControl(max_messages=1)

  subscriber.subscribe(subscription_path, callback=callback, flow_control=flow_control)

  syslog.syslog(syslog.LOG_INFO, 'Listening for messages on: ' + subscription_path)
  while True:
      time.sleep(60)

No messages are ever sent on the topic, this is all on a completely idle system.

Eventually the thread count starts to creep up, and after an hour or two CPU usage spikes to 100%. I left it running overnight with some code in the loop printing all thread stacktraces. First I checked how many threads were running:

david@myinstance:~$ ps auxw -T | grep python | wc -l
461

Here's an excerpt of the stacktraces printed, and it goes on and on:

<CleanupThread(Thread-1178, started daemon 140501834020608)>
  File "/usr/lib/python3.5/threading.py", line 882, in _bootstrap
    self._bootstrap_inner()
  File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.5/threading.py", line 862, in run
    self._target(*self._args, **self._kwargs)
  File "/home/david/env/lib/python3.5/site-packages/grpc/_channel.py", line 185, in consume_request_iterator
    request = next(request_iterator)
  File "/home/david/env/lib/python3.5/site-packages/google/cloud/pubsub_v1/subscriber/_consumer.py", line 213, in _request_generator_thread
    request = self._request_queue.get()
  File "/usr/lib/python3.5/queue.py", line 164, in get
    self.not_empty.wait()
  File "/usr/lib/python3.5/threading.py", line 293, in wait
    waiter.acquire()

<CleanupThread(Thread-335, started daemon 140504192370432)>
  File "/usr/lib/python3.5/threading.py", line 882, in _bootstrap
    self._bootstrap_inner()
  File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.5/threading.py", line 862, in run
    self._target(*self._args, **self._kwargs)
  File "/home/david/env/lib/python3.5/site-packages/grpc/_channel.py", line 185, in consume_request_iterator
    request = next(request_iterator)
  File "/home/david/env/lib/python3.5/site-packages/google/cloud/pubsub_v1/subscriber/_consumer.py", line 213, in _request_generator_thread
    request = self._request_queue.get()
  File "/usr/lib/python3.5/queue.py", line 164, in get
    self.not_empty.wait()
  File "/usr/lib/python3.5/threading.py", line 293, in wait
    waiter.acquire()

<CleanupThread(Thread-44, started daemon 140505006462720)>
  File "/usr/lib/python3.5/threading.py", line 882, in _bootstrap
    self._bootstrap_inner()
  File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.5/threading.py", line 862, in run
    self._target(*self._args, **self._kwargs)
  File "/home/david/env/lib/python3.5/site-packages/grpc/_channel.py", line 185, in consume_request_iterator
    request = next(request_iterator)
  File "/home/david/env/lib/python3.5/site-packages/google/cloud/pubsub_v1/subscriber/_consumer.py", line 213, in _request_generator_thread
    request = self._request_queue.get()
  File "/usr/lib/python3.5/queue.py", line 164, in get
    self.not_empty.wait()
  File "/usr/lib/python3.5/threading.py", line 293, in wait
    waiter.acquire()

After checking out the google-cloud-python source code I noticed that the thread policy_class makes a distinction between Python 3.5 and 3.6:

https://github.com/GoogleCloudPlatform/google-cloud-python/blob/af2e6cac62fc9d380bac780fe63b1416ef94da48/pubsub/google/cloud/pubsub_v1/subscriber/policy/thread.py#L139

I installed Python 3.6.3 and tried it out, and it doesn't seem to exhibit the same behavior – the number of threads stays limited. So in that sense I have a workaround, but I figured I would still report this in case someone else is seeing a similar problem.

dmontag commented 6 years ago

I take that back – after left running, Python 3.6 seems to exhibit the same behavior of spiking CPU, though the number of threads hasn't increased like it did on 3.5.

Don't know if this helps, but here's the gdb thread printout:

(gdb) info threads
  Id   Target Id         Frame 
* 1    Thread 0x7f0d6e59d700 (LWP 3781) "python" 0x00007f0d6d34a593 in select () at ../sysdeps/unix/syscall-template.S:84
  2    Thread 0x7f0d67ee7700 (LWP 3807) "python" pthread_cond_timedwait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:225
  3    Thread 0x7f0d686e8700 (LWP 3808) "python" pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
  4    Thread 0x7f0d65c22700 (LWP 3850) "python" 0x00007f0d6e179827 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x7f0d60000ad0) at ../sysdeps/unix/sysv/linux/futex-internal.h:205
  5    Thread 0x7f0d66423700 (LWP 3851) "python" 0x00007f0d6e179827 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x7f0d50029540) at ../sysdeps/unix/sysv/linux/futex-internal.h:205
  6    Thread 0x7f0d4ffff700 (LWP 3869) "python" pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
  7    Thread 0x7f0d653e1700 (LWP 5052) "python" 0x00007f0d6d34870d in poll () at ../sysdeps/unix/syscall-template.S:84
  8    Thread 0x7f0d4f7fe700 (LWP 5053) "python" 0x00007f0d6e179827 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x7f0d3c0016c0) at ../sysdeps/unix/sysv/linux/futex-internal.h:205

ps auxw -L showed LWP 5052 as using all the CPU, with the others being at 0:

USER       PID   LWP %CPU NLWP %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
david     3781  3781  0.0    8  1.5 720244 60384 ?        tl   12:46   0:01 python /home/david/script.py
david     3781  3807  0.0    8  1.5 720244 60384 ?        tl   12:46   0:00 python /home/david/script.py
david     3781  3808  0.0    8  1.5 720244 60384 ?        Sl   12:46   0:00 python /home/david/script.py
david     3781  3850  0.0    8  1.5 720244 60384 ?        Sl   12:46   0:00 python /home/david/script.py
david     3781  3851  0.0    8  1.5 720244 60384 ?        Sl   12:46   0:00 python /home/david/script.py
david     3781  3869  0.0    8  1.5 720244 60384 ?        Sl   12:47   0:00 python /home/david/script.py
david     3781  5052 96.7    8  1.5 720244 60384 ?        Rl   14:01   0:11 python /home/david/script.py
david     3781  5053  0.0    8  1.5 720244 60384 ?        Sl   14:01   0:00 python /home/david/script.py

Don't know if this helps, but figured I'd add it here in case something stands out.

dhermes commented 6 years ago

@dmontag Thanks for this example. I'll try to reproduce today.

I'm hoping that the recent releases (in particular 0.29.4) have resolved this.

dhermes commented 6 years ago

@dmontag It's worth noting that the 3.6/2.7 special branch is just to control how the thread pool executor names the threads in the pool. This feature was added in 3.6

New in version 3.6: The _thread_nameprefix argument was added to allow users to control the threading.Thread names for worker threads created by the pool for easier debugging.

It is also present in the 2.7 backport.

dhermes commented 6 years ago

@dmontag So I am running a modified version of this, but it's essentially the same. I even attached the topic to bucket change notifications, though that doesn't seem relevant to the issue.

In google-cloud-pubsub==0.29.2, I am seeing a Thread-gRPC-ConsumeRequestIterator turned into a zombie every 60s or so. These are threads spawned by gRPC as part of our bidirectional consumer. They stay open because we fail to send STOP to the queue they are attached to. This behavior was fixed by #4503 and included in the 0.29.4 release.

I am going to pre-emptively close this issue (while I continue debugging). As of right now, my example has been running (with 0.29.2, since I'm trying to reproduce) for just over 12 minutes and the number of threads has grown to 17, but CPU usage is still hovering near 0%.

Would you mind testing with 0.29.4 in your "simulated conditions" and see if the issue is resolved?

dmontag commented 6 years ago

@dhermes Thanks for checking this one out. I tested on 0.29.4 yesterday, and it exhibits the same behavior of having a thread eventually spike and run at 100% sustained CPU – it takes about 65-70 minutes before the condition presents though. I updated my comment on #4274 to reflect this: https://github.com/GoogleCloudPlatform/google-cloud-python/issues/4274#issuecomment-350780990

For now I'll have to run on 0.27.0 which has been running ok for the past 20-or-so hours. The Stackdriver dashboard reports Pull requests for 0.27 vs StreamingPull requests for 0.29 – maybe that's a place to start looking for differences on the client side.

dhermes commented 6 years ago

Though I am loosely familiar with many Google APIs, I don't have deep knowledge of many of them. Would you mind pointing to docs for how I might create such a Stackdriver dashboard? (I have a general idea, but am overwhelmed by the sheer number of metrics for each Pub / Sub resource type in the metrics explorer.)

Thanks for continuing to help debug this. I wonder if it'd be possible for us to get a shared piece of code that we are both running? I've been tracking my experiments here https://github.com/dhermes/google-cloud-pubsub-performance and have yet to reproduce a CPU spike (though 5m is the longest anything has run so far).

dmontag commented 6 years ago

For Stackdriver you basically click Monitoring on your GCP project in the regular GCP menu, go through the setup wizard, then either:

  1. Go to Resources -> Cloud Pub/Sub, then click the topic, then click the subscription, and then it'll show the operations & requests on the right-hand side. Both kinds of operations need to be present in the stats for the charts to show.

  2. Create a custom dashboard in the Stackdriver menu, then add a new chart for PubSub Subscription, add the StreamingPull Operations and Pull Message Operations dimensions, and save the chart. Again, you need both event types to have occurred to see the stats for both. It'll look something like this: image

Not sure if that'll help track down the bug though, it was mostly a thought.

I can try to provide you access to an instance with my exact script toward the end of the week, and hopefully that might let you observe the behavior up close.

dhermes commented 6 years ago

@dmontag I have at least been able to reproduce this (in 0.29.4) with a subscription that never receives a message. The CPU usage spikes somewhere between running 1h6m29s (0.4%), 1h6m34s (76.6%) and 1h6m39s (100.1% and stable at ~100 from here on out).

timeLevel=03989775:INFO
Heartbeat:
psutil info=
  CPU usage=0.4%
----------------------------------------
timeLevel=03994789:INFO
Heartbeat:
psutil info=
  CPU usage=76.6%
----------------------------------------
timeLevel=03999802:INFO
Heartbeat:
psutil info=
  CPU usage=100.1%

I currently have no idea how this is happening. I'm using the Unix gettid command to add a tracking hook in threading.Thread and the CPU usage is coming with only 8 active pthreads owned by the parent.

dmontag commented 6 years ago

Sounds exactly like what I'm seeing. Good to hear that it's a reproducible problem – that's the first step toward fixing it :-)

dhermes commented 6 years ago

I think I've tracked it down. In the exact "spin worker" thread where the CPU usage begins to spike to 100%, there is a thread that checks the credentials. This is a sign that the credentials are being refreshed. This makes sense, since an access token has a lifetime of an hour plus some wiggle room.

/cc @jonparrott

theacodes commented 6 years ago

Very interesting, but why does refreshing the creds cause a sustained cpu spike?

theacodes commented 6 years ago

At least now it should be easier to debug, you can fake credentials expiring.

dhermes commented 6 years ago

I think the 60 minute mark / extra credentials check was a false alarm signal. I have verified with strace that threads using 100% CPU are stuck in the POLLIN/POLLHUP dance. I have also run with GRPC_TRACE=all and GRPC_VERBOSITY=debug so that may yield some ideas. (UPDATE: ZOMG the captured gRPC trace logs went from 64MB to 37GB once the CPU thrashing started. Holy moly.)

Seems like https://github.com/grpc/grpc/pull/13665/ is an attempt at a stop-gap.

We may be able to address by tracking CPU usage and killing an active subscription based on this, but that wouldn't be very fun.

dmontag commented 6 years ago

Interesting indeed. Any idea why 0.27 doesn't exhibit this? (I'm not familiar with the library internals)

dhermes commented 6 years ago

Any idea why 0.27 doesn't exhibit this? (I'm not familiar with the library internals)

It doesn't use a StreamingPull.

dmontag commented 6 years ago

Do you think it's reproducible without PubSub-specific code? I.e. if a test case can be written that only uses gRPC to reproduce it?

dhermes commented 6 years ago

@dmontag Yes, see https://github.com/grpc/grpc/issues/9688

dhermes commented 6 years ago

This issue actually corresponds to two distinct parts.

The first has been resolved (e.g. by #4444, #4472, #4498, #4503, #4554, #4558, etc.) and the second has it's own issue (I just made it) at #4600.