open-telemetry / opentelemetry-python

OpenTelemetry Python API and SDK
https://opentelemetry.io
Apache License 2.0
1.8k stars 625 forks source link

Process never exits due to hang in shutdown #2284

Open flashgorman-chorus opened 2 years ago

flashgorman-chorus commented 2 years ago

We have a process that sometimes fails to exit until we hit ctrl-c. Traceback at that time shows the process is hung, attempting to "join" a worker thread:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/opentelemetry/sdk/trace/__init__.py", line 1136, in shutdown
    self._active_span_processor.shutdown()
  File "/usr/local/lib/python3.7/dist-packages/opentelemetry/sdk/trace/__init__.py", line 166, in shutdown
    sp.shutdown()
  File "/usr/local/lib/python3.7/dist-packages/opentelemetry/sdk/trace/export/__init__.py", line 385, in shutdown
    self.worker_thread.join()
  File "/usr/lib/python3.7/threading.py", line 1044, in join
    self._wait_for_tstate_lock()
  File "/usr/lib/python3.7/threading.py", line 1060, in _wait_for_tstate_lock
    elif lock.acquire(block, timeout):

In this case, the process completed in about 500 ms and would have produced only a few spans. The process was configured to export to a remote OT collector which may or may not have been reachable at the time - but we expect the "reachability" of the collector to not interfere with process execution. We gave it like 30 seconds to complete before terminating the process.

This, of course, interferes with auto-scaling and our ability to manage processes.

Please advise.

owais commented 2 years ago

This shouldn't happen and looks like a bug. I wasn't able to reproduce it with a simple example. Could you please share a reproducible example with all otel dependencies listed?

flashgorman-chorus commented 2 years ago

Thanks, @owais . I was able to reproduce the issue with this stand-alone test:

import time

from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import (
    OTLPSpanExporter,
)
from opentelemetry.sdk.resources import DEPLOYMENT_ENVIRONMENT
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.resources import SERVICE_NAME
from opentelemetry.sdk.resources import SERVICE_NAMESPACE
from opentelemetry.sdk.resources import SERVICE_VERSION
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.sdk.trace.sampling import TraceIdRatioBased
from opentelemetry.trace import set_tracer_provider
from opentelemetry import trace

tracer = trace.get_tracer(__name__)

def main():
    print("Initializing OpenTelemetry...")
    provider = TracerProvider(
        sampler=TraceIdRatioBased(0.1),
        resource=Resource.create(
            {
                DEPLOYMENT_ENVIRONMENT: "production",
                "platform": "chorus",
                SERVICE_NAME: "mike",
                SERVICE_NAMESPACE: "chorus",
                SERVICE_VERSION: "1.0",
            }
        ),
    )
    set_tracer_provider(provider)
    span_processor = BatchSpanProcessor(
        OTLPSpanExporter(endpoint="otel.chorus.internal:4317")
    )
    provider.add_span_processor(span_processor)
    print("Pretending to do work...")

    with tracer.start_as_current_span("foo"):
        time.sleep(0.5)

    print("Done!  (Python should exit now.)")

if __name__ == "__main__":
    main()

It took 10-20 tries but eventually exhibited the "hung" behavior reported above. After it hung, I hit Ctrl+C to get the process to exit. Here's the console output of that run:

Initializing OpenTelemetry...
Pretending to do work...
Done!  (Python should exit now.)
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/Users/mgorman/projects/web-server/.env/lib/python3.7/site-packages/opentelemetry/sdk/trace/__init__.py", line 1136, in shutdown
    self._active_span_processor.shutdown()
  File "/Users/mgorman/projects/web-server/.env/lib/python3.7/site-packages/opentelemetry/sdk/trace/__init__.py", line 166, in shutdown
    sp.shutdown()
  File "/Users/mgorman/projects/web-server/.env/lib/python3.7/site-packages/opentelemetry/sdk/trace/export/__init__.py", line 385, in shutdown
    self.worker_thread.join()
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/threading.py", line 1044, in join
    self._wait_for_tstate_lock()
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/threading.py", line 1060, in _wait_for_tstate_lock
    elif lock.acquire(block, timeout):
KeyboardInterrupt

And here are all of the OT packages I have installed:

opentelemetry-api                           1.6.2
opentelemetry-exporter-otlp                 1.6.2
opentelemetry-exporter-otlp-proto-grpc      1.6.2
opentelemetry-exporter-otlp-proto-http      1.6.2
opentelemetry-instrumentation               0.25b2
opentelemetry-instrumentation-boto          0.25b2
opentelemetry-instrumentation-botocore      0.25b2
opentelemetry-instrumentation-celery        0.25b2
opentelemetry-instrumentation-elasticsearch 0.25b2
opentelemetry-instrumentation-pymongo       0.25b2
opentelemetry-instrumentation-redis         0.25b2
opentelemetry-instrumentation-requests      0.25b2
opentelemetry-instrumentation-sqlalchemy    0.25b2
opentelemetry-instrumentation-tornado       0.25b2
opentelemetry-proto                         1.6.2
opentelemetry-sdk                           1.6.2
opentelemetry-semantic-conventions          0.25b2
opentelemetry-util-http                     0.25b2
flashgorman-chorus commented 2 years ago

FYI - when I replace OTLPSpanExporter with ConsoleSpanExporter, I am not able to reproduce the issue (after 70 tests).

flashgorman-chorus commented 2 years ago

Restoring the original OTLPSpanExporter from my test but disabling the sampler, the issue now happens on every execution. So, previously, the reason it took "10 or 20 times" was because, when the sampler did not sample the span, then everything was fine. But when every span is processed (due to commenting-out the sampler), it causes the Python process to hang every time.

flashgorman-chorus commented 2 years ago

To help illuminate what's going on during shutdown, I "monkey patched" BatchSpanProcessor.shutdown method in order to print out a traceback of each thread, to see what each thread is doing, prior to dispatching to the "real" shutdown method. The results were, indeed, illuminating:

Initializing OpenTelemetry...
Pretending to do work...
Done!  (Python should exit now.)
in BatchSpanProcessor.shutdown!
These threads exist:
<_MainThread(MainThread, stopped 4441382400)>
<Thread(OtelBatchSpanProcessor, started daemon 123145478844416)>
BatchSpanProcessor's worker_thread is <Thread(OtelBatchSpanProcessor, started daemon 123145478844416)> (123145478844416)
Here's what all threads are doing as we speak:

# ThreadID: 123145478844416
File: "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/threading.py", line 890, in _bootstrap
  self._bootstrap_inner()
File: "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/threading.py", line 926, in _bootstrap_inner
  self.run()
File: "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/threading.py", line 870, in run
  self._target(*self._args, **self._kwargs)
File: "/Users/mgorman/projects/web-server/.env/lib/python3.7/site-packages/opentelemetry/sdk/trace/export/__init__.py", line 237, in worker
  self.condition.wait(timeout)
File: "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/threading.py", line 300, in wait
  gotit = waiter.acquire(True, timeout)

# ThreadID: 4441382400
File: "/Users/mgorman/projects/web-server/.env/lib/python3.7/site-packages/opentelemetry/sdk/trace/__init__.py", line 1136, in shutdown
  self._active_span_processor.shutdown()
File: "/Users/mgorman/projects/web-server/.env/lib/python3.7/site-packages/opentelemetry/sdk/trace/__init__.py", line 166, in shutdown
  sp.shutdown()
File: "/Users/mgorman/projects/web-server/ot_test.py", line 39, in __my_shutdown
  for filename, lineno, name, line in traceback.extract_stack(stack):
Forwarding the call to shutdown...

I can provide my monkey-patching if needed but, hopefully, this is enough to figure out what's going on.

flashgorman-chorus commented 2 years ago

Appears the worker thread is hung here: image

srikanthccv commented 2 years ago

There is an exponential backoff and jitter retry mechanism (64 secs) for OTLPSpanExporter. This same issue has come up a few times since people are not aware of it. The spec is vague and doesn't have the recommended timeout. The last time we talked about it we agreed on this value (earlier it was 900 seconds taken from go impl) and didn't want to make this configureable (via constructor/env) to not introduce breaking changes. I think in this kind of scenarios it should finish up as much as possible and cancel the ongoing work and exit.

owais commented 2 years ago

Thanks for the detailed explanation @flashgorman-chorus. This is a great report!

@lonewolf3739 Shouldn't ctrl-c cancel all in-flight requests and exit immediately? That's what I would expect as an end-user.

srikanthccv commented 2 years ago

Author was expecting it to end without having to hit CTRL-C, which makes sense. It exists after a minute but nobody waits till that point and perceive this as a process hung.

owais commented 2 years ago

Ah, that makes sense.

owais commented 2 years ago

I think this is working as expected in that case and can probably be closed. It would be great if processors could watch for sigkill, log a helpful message and exit instead of crashing but not a big deal especially if the implementation is not going to be straight forward.

ocelotl commented 2 years ago

I agree with @owais, what would really "fix" this issue is a log message that lets the user know the process is waiting

flashgorman-chorus commented 2 years ago

Thanks, all. I would like to get clarification if you are saying there is nothing we can do to influence this "process shutdown" behavior. For example, is there a timeout configuration we can provide if we want the shutdown behavior to not wait?

Ideally, we would like a property that says "how long should OTLPSpanExporter wait to flush spans (what I assume it is waiting on) during shutdown?" The default can be 64 seconds, as it is today, but if your priority is "process management" and don't care about "lost spans", you might set this to 0 seconds which says "never delay process shutdown due to... well, for any reason."

owais commented 2 years ago

@flashgorman-chorus there is no special timeout during shutdown today. We just have a single retry timeout which we could make configurable but I'm not sure if setting it to 0 would be a good solution to this problem.

I think what we should do is detect SIGKILL and make one last attempt to flush everything in memory without retry.

owais commented 2 years ago

I think tracer provider should be responsible for detecting this and calling shutdown() on itself resulting in all processors shutting down and flushing ASAP.

owais commented 2 years ago

@ocelotl @lonewolf3739 thoughts about this? ^

flashgorman-chorus commented 2 years ago

Thanks. Will await your thoughtful decision. But, just to be clear, we are talking about a process exiting naturally - no human or Ctrl+C involved. I don't know if SIGKILL is involved under "natural exit" but if that only happens under Ctrl+C or kill <pid> scenarios, I wanted to make sure it is clear we are hoping for natural process exit without human intervention and without delay.

owais commented 2 years ago

@flashgorman-chorus I'm not sure what we can do to "automate" that case. Tracer provider does have a shutdown method and for "natural" exit cases, you could call provider.shutdown() explicitly in your code and that should have the same effect.

owais commented 2 years ago

if __name__ == "__main__":
    main()
    provider = get_tracer_provider()
    provider.force_flush()
    provider.shutdown()

See if this works for you ^ @flashgorman-chorus Not sure if you need to call force_flush as well but you can try and s

flashgorman-chorus commented 2 years ago

That, unfortunately, will not work because we have at least 100 different scripts so there's no single point of entry and, thus, no single point of exit where we could call this.

Moreover, we see shutdown being called - so that is not the problem. It is what's going on inside shutdown that causes the process to hang. If force_flush would do the trick, then what would be nice is if shutdown looked like this:

def shutdown(self, force_flush: bool = True):
    if force_flush:
        self.force_flush()

    # proceed with shutdown...
owais commented 2 years ago

I think shutdown should force flush implicitly if it doesn't already.

flashgorman-chorus commented 2 years ago

I tested this with an explicit call to provider.force_flush() right before I expect the process to exit and all that did was add a 30-second delay.

I next attempted to add a call to self.force_flush() inside my monkey-patched version of shutdown (along the lines I suggested above) with the same results - an extra 30 seconds.

I did verify that if I simply wait 64 seconds, the process will exit. What I don't understand is what is it trying to do for 64 seconds? It can't be "that's the time it takes to 'flush' all of the queued-up spans" because I only create one span in this test.

ocelotl commented 2 years ago

What I don't understand is _what is it trying to do for 64 seconds?

It is mostly waiting. We have to do this, it is required in the spec and the intention of the spec is to protect the destination from being overwhelmed. If we allowed the exporting to wait 0 seconds we would be making it possible for this to happen and it would be against the spec. In my opinion we should not allow the user to introduce another delay algorithm, as the spec requires this one.

If you have specific purposes, testing or otherwise, I would recommend to make a custom exporter, it should not be very hard:

https://github.com/open-telemetry/opentelemetry-python/blob/main/exporter/opentelemetry-exporter-otlp-proto-grpc/src/opentelemetry/exporter/otlp/proto/grpc/exporter.py#L310

Changing delay to 0 in that line should do the trick.

I'm sorry, I wish we could provide a solution for this directly in our published code, but I think for this particular situation the spec does not allow us to do so (@owais, @lonewolf3739 please comment if you disagree).

Please let me know if you need help implementing this, @flashgorman-chorus, you can contact me directly via Slack at the CNCF workspace.

flashgorman-chorus commented 2 years ago

Alright! Thanks for the thoughtful response. We'll put our heads together here and decide how we want to proceed (or if we need any help). Everyone's desire to do the right thing has been very refreshing. Thank you.

owais commented 2 years ago

@ocelotl @lonewolf3739 I think retry mechanism with jitter should only apply during normal operation of a service. That should be enough to comply with the spec. Once an application starts to shut down, I think it might be more reasonable to make an immediate and final (or may be a few without exponential back-off) attempt at exporting everything in memory and quit after that.

Shihao-Zhong commented 2 years ago

Hi guys,

We are facing same challenges about the stuck in _wait_for_tstate_lock function, can we apply some special logic for that when the application is ending?

markfickett commented 2 years ago

I'm occasionally seeing a hang similar to this. My app seems to hang indefinitely (> 1h), which doesn't match up with the 64s timeout described above. The app is using Python multiprocessing.

I think I'm seeing the same stack trace when I eventually ^C the program:

^CError in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/.../python3.9/site-packages/opentelemetry/sdk/trace/__init__.py", line 1169, in shutdown
    self._active_span_processor.shutdown()

  File "/.../python3.9/site-packages/opentelemetry/sdk/trace/__init__.py", line 171, in shutdown
    sp.shutdown()
  File "/.../python3.9/site-packages/opentelemetry/sdk/trace/export/__init__.py", line 402, in shutdown
    self.worker_thread.join()
  File "/usr/lib/python3.9/threading.py", line 1033, in join
Aborted!
    self._wait_for_tstate_lock()
  File "/usr/lib/python3.9/threading.py", line 1049, in _wait_for_tstate_lock
    elif lock.acquire(block, timeout):
KeyboardInterrupt

I'm using:

opentelemetry-api==1.11.1
opentelemetry-sdk==1.11.1
opentelemetry-exporter-otlp-proto-grpc==1.11.1
opentelemetry-instrumentation-requests==0.30b1
opentelemetry-instrumentation-logging==0.30b1
opentelemetry-instrumentation-sqlalchemy==0.30b1

I'll look out for what conditions trigger it / if I can come up with a repro case that hangs for many minutes, but here's at least the initial information.

alkurbatov commented 1 year ago

Hello colleagues. I experience the same issue :(

My setup:

The problem If by some reason the all-in-one image goes down before shutdown of my service, the service fails to send tracing data and goes into endless loop retrying with different timeouts. In this case the only way I can stop it is to send SIGKILL.

The reason is exactly as discussed above: the thread which sends the data never exits as it doesn't give up after several numbers of failed retries.

The question is: are there any plans to handle this issue? To me it would be enough to provide a kind of cancellation api to forcibly finish the task during service shutdown or even specify maximum number of retries to give up on retrying.

cazador481 commented 1 year ago

I am still seeing this problem. Is there a work around?

Error:

^CException ignored in atexit callback: <bound method MeterProvider.shutdown of <opentelemetry.sdk.metrics._internal.MeterProvider object at 0x7fffe6ca54e0>>
Traceback (most recent call last):
  File "/home/nv/utils/ci/nvci_create/1.2.1/lib/python3.10/site-packages/opentelemetry/sdk/metrics/_internal/__init__.py", line 440, in shutdown
    metric_reader.shutdown(
  File "/home/nv/utils/ci/nvci_create/1.2.1/lib/python3.10/site-packages/opentelemetry/sdk/metrics/_internal/export/__init__.py", line 543, in shutdown
    self._daemon_thread.join(
  File "/home/utils/Python/builds/3.10.10-20230601/lib/python3.10/threading.py", line 1100, in join
    self._wait_for_tstate_lock(timeout=max(timeout, 0))
  File "/home/utils/Python/builds/3.10.10-20230601/lib/python3.10/threading.py", line 1116, in _wait_for_tstate_lock
    if lock.acquire(block, timeout):
clumsy commented 8 months ago

I wonder if this could be caused by this deadlock issue in ProcessPoolExecutor. In such case the workaround is switching to multiprocessing.Pool or to the latest Python version.