open-telemetry / opentelemetry-python

OpenTelemetry Python API and SDK
https://opentelemetry.io
Apache License 2.0
1.72k stars 606 forks source link

Current handling of OTEL_EXPORTER_OTLP_TIMEOUT is not in line with spec #4044

Open LarsMichelsen opened 1 month ago

LarsMichelsen commented 1 month ago

Describe your environment

No response

What happened?

In OTLPExporterMixin.__init__ the environment variable OTEL_EXPORTER_OTLP_TIMEOUT looks like is handled as seconds while the SDK says it has to be in milliseconds (https://opentelemetry.io/docs/languages/sdk-configuration/otlp-exporter/#otel_exporter_otlp_timeout) and the compliance matrix says the Python implementation is compliant.

The description in opentelemetry-sdk/src/opentelemetry/sdk/environment_variables.py does not mention the exact unit.

Since the timeout is currently handed over as timeout to grpc.experimental.unary_unary, which is documented like this:

      timeout: An optional duration of time in seconds to allow for the RPC,
        after which an exception will be raised. If timeout is unspecified,
        defaults to a timeout controlled by the
        GRPC_PYTHON_DEFAULT_TIMEOUT_SECONDS environment variable. If that is
        unset, defaults to 60 seconds. Supply a value of None to indicate that
        no timeout should be enforced.

To my understanding this only an effect in case there is a connection made, but no answer on the RPC sent back, which might not happen very often.

Steps to Reproduce

Start a server.py which accepts the client but does not respond:

#!/usr/bin/env python3

import time
from concurrent import futures

import grpc
from opentelemetry.proto.collector.trace.v1.trace_service_pb2_grpc import add_TraceServiceServicer_to_server, TraceServiceServicer

class NoResponseTraceService(TraceServiceServicer):
    def Export(self, request, context):
        while True:
            time.sleep(1)  # Sleep forever

server = grpc.server(futures.ThreadPoolExecutor(max_workers=10))
add_TraceServiceServicer_to_server(NoResponseTraceService(), server)

server.add_insecure_port('[::]:55680')
server.start()

try:
    while True:
        time.sleep(86400)
except KeyboardInterrupt:
    server.stop(0)

Use this test client.py:

#!/usr/bin/env python3

from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor

trace.set_tracer_provider(TracerProvider())

trace.get_tracer_provider().add_span_processor(
    BatchSpanProcessor(OTLPSpanExporter(endpoint="localhost:55680", insecure=True))
)
tracer = trace.get_tracer(__name__)
with tracer.start_as_current_span("Hello") as span:
    span.set_attribute("language", "Python")

Preparation: Start the server

Test 1: Start the client without environment variable. It uses the default timeout of 10 seconds per try.

> python3 client.py 2>&1 | ts
Jul 11 08:55:59 Transient error StatusCode.DEADLINE_EXCEEDED encountered while exporting traces to localhost:55680, retrying in 1s.
Jul 11 08:56:10 Transient error StatusCode.DEADLINE_EXCEEDED encountered while exporting traces to localhost:55680, retrying in 2s.
Jul 11 08:56:22 Transient error StatusCode.DEADLINE_EXCEEDED encountered while exporting traces to localhost:55680, retrying in 4s.
Jul 11 08:56:36 Transient error StatusCode.DEADLINE_EXCEEDED encountered while exporting traces to localhost:55680, retrying in 8s.
Jul 11 08:56:54 Transient error StatusCode.DEADLINE_EXCEEDED encountered while exporting traces to localhost:55680, retrying in 16s.
Jul 11 08:57:20 Transient error StatusCode.DEADLINE_EXCEEDED encountered while exporting traces to localhost:55680, retrying in 32s.

Test 2: Start the client with the environment variable. The time waiting for the export call is now 5 seconds instead of 10 seconds. But according to the spec it should be 5ms with this value.

> OTEL_EXPORTER_OTLP_TIMEOUT=5 python3 client.py 2>&1 | ts
Jul 11 08:58:56 Transient error StatusCode.DEADLINE_EXCEEDED encountered while exporting traces to localhost:55680, retrying in 1s.
Jul 11 08:59:02 Transient error StatusCode.DEADLINE_EXCEEDED encountered while exporting traces to localhost:55680, retrying in 2s.
Jul 11 08:59:09 Transient error StatusCode.DEADLINE_EXCEEDED encountered while exporting traces to localhost:55680, retrying in 4s.
Jul 11 08:59:18 Transient error StatusCode.DEADLINE_EXCEEDED encountered while exporting traces to localhost:55680, retrying in 8s.
Jul 11 08:59:31 Transient error StatusCode.DEADLINE_EXCEEDED encountered while exporting traces to localhost:55680, retrying in 16s.
Jul 11 08:59:52 Transient error StatusCode.DEADLINE_EXCEEDED encountered while exporting traces to localhost:55680, retrying in 32s.

Expected Result

The Python implementation is in line with the spec and the documentation is clear about the unit.

Actual Result

Setting OTEL_EXPORTER_OTLP_TIMEOUT=1000 sets a timeout of 1000 seconds instead of 1 second.

Additional context

No response

Would you like to implement a fix?

None

xrmx commented 1 month ago

Good catch! This affects all OTEL_EXPORTER_OTLP*_TIMEOUT envs btw

lzchen commented 1 month ago

This seems like a difficult issue to address without greatly breaking a majority of users. As of now, there isn't really an elegant way of fixing this besides from a major version bump of the otlp exporters. For maintainers, this implies also a change in release process since the otlp exporters won't be released with the same versions anymore. Open to other suggestions as well.

LarsMichelsen commented 1 month ago

This seems like a difficult issue to address without greatly breaking a majority of users.

Not sure, I can not really judge that. Before thinking about how to communicate a change, I'd try to clarify the actual technical consequences for users.

  1. Users which configured the timeout in seconds, would once interpreted as milliseconds, provide a really small timeout to the grpc channel, which could lead to timeouts in the communication (have not checked how it is implemented in grpc) which would then lead to retries which might succeed or fail as well. In the end data may be dropped.
  2. Users which configured millisecond values according to the standard documentation are currently in danger of too high timeouts leading to unexpectedly long hanging processes, which may also result in dropped data. I guess this is the less often happening situation, but may still happen.

Please also note: Other than in other language client implementations the timeout has no influence on the total time a export may take (see #4043). Currently the timeout is applied to each export call, the exponential backoff retry is coming on top of that. This is another related aspect which totally unaligned with the spec and other language clients I had a look at.

xrmx commented 1 month ago

As discussed in the SIG a plan could be to add a python specific environment variable to delegate the choice of the adhering to the spec (ms) or keep the current behaviour (s) to the distributions. This way we buy some time upstream but downstream can act fast. And advanced users can do the conversion in autonomy. What do you think?

xrmx commented 1 month ago

Other than handling the conversion from the env var we need to handle the fact that the exporters takes the timeout in seconds as int so in order to have a fine-grain value we need to move to seconds as float or keep int but get milliseconds.

ocelotl commented 1 month ago

Discussed this issue in the SIG meeting today.

Here is what I propose.

We are currently in 1.25.0. We release 1.26.0 normally (this should be happening in the next few days).

After we release 1.26.0, we add a warning that gets triggered when someone uses the OTEL_EXPORTER_OTLP_TIMEOUT environment variable. This warning will mention:

  1. That in release 1.27.0 the value it uses is interpreted as it being in seconds, that it should be interpreted as being in milliseconds.
  2. That in release 1.28.0 we will be making a change in the interpretation of the environment variable from seconds to milliseconds and that can cause issues for our end users.

Then we release 1.27.0 which could not cause an issue and will show users this first warning above.

After releaseing 1.27.0, we change the warning above so that it will mention:

  1. That in release 1.28.0 we changed the interpretation of the environment variable from seconds to milliseconds and that can cause issues for our end users.
  2. That users should check the value of the environment variable is in milliseconds if an issue is happening.

Then we release 1.28.0 which could cause an issue and will show users this second warning above.

After releasing 1.28.0 we remove the warning.

Then we release 1.29.0 which will show users no warning related to this issue.

ocelotl commented 1 month ago

Something else, right now we use an integer to store the timeout value. It may be useful for someone if we stored that value as a float so that it can define a timeout smaller than a second (now) or a millisecond (after we fix this issue).

We have decided to keep using an integer for the time being. Afte we fix this issue we may discuss again if we want to use a float instead.

aabmass commented 1 month ago

@LarsMichelsen thank you for the summary in https://github.com/open-telemetry/opentelemetry-python/issues/4044#issuecomment-2223671513, I agree with your conclusions. In particular:

  • Users which configured the timeout in seconds, would once interpreted as milliseconds, provide a really small timeout to the grpc channel, which could lead to timeouts in the communication (have not checked how it is implemented in grpc) which would then lead to retries which might succeed or fail as well. In the end data may be dropped.

I'm concerned that will be too disruptive for us to make the change proposed by @ocelotl and fix this in a minor version. Users might ignore or otherwise not see the warnings. Some other thoughts

ocelotl commented 1 month ago

Users might ignore or otherwise not see the warnings.

And their application will break. That will teach them a lesson. :sunglasses:

xrmx commented 1 month ago

Something else, right now we use an integer to store the timeout value. It may be useful for someone if we stored that value as a float so that it can define a timeout smaller than a second (now) or a millisecond (after we fix this issue).

We have decided to keep using an integer for the time being. Afte we fix this issue we may discuss again if we want to use a float instead.

What I meant yesterday it that to reduce the breakage from going seconds to ms in the Exporter classes we may keep seconds in the timeout parameter and switch its type to float to still being able to express the values can be set in milliseconds via the environment variables. If we change the semantics of exporter timeout parameter to milliseconds does not make sense in my opinion to move it to float.

xrmx commented 1 month ago

@LarsMichelsen thank you for the summary in #4044 (comment), I agree with your conclusions. In particular:

  • Users which configured the timeout in seconds, would once interpreted as milliseconds, provide a really small timeout to the grpc channel, which could lead to timeouts in the communication (have not checked how it is implemented in grpc) which would then lead to retries which might succeed or fail as well. In the end data may be dropped.

I'm concerned that will be too disruptive for us to make the change proposed by @ocelotl and fix this in a minor version. Users might ignore or otherwise not see the warnings. Some other thoughts

* Since this is only affecting the OTLP exporter and not the "core" API/SDK, wdyt of making a new major version release?

* What are the consequences of NOT fixing the issue and keeping things as they are?

I'm fine with adding a warning but to ease the conversion for distros I would add a configuration option to choose the timeout unit, explicitly e.g. OTEL_PYTHON_EXPORTER_UNIT_TEMPORARY_COMPAT=[s|ms] or implicitly e.g. OTEL_PYTHON_EXPORTER_SECONDS_TEMPORARY_COMPAT=[true|false].

Bumping the major and avoid adding more code to handle this is fine for me too.

LarsMichelsen commented 1 month ago

Something else, right now we use an integer to store the timeout value. It may be useful for someone if we stored that value as a float so that it can define a timeout smaller than a second (now) or a millisecond (after we fix this issue).

We have decided to keep using an integer for the time being. Afte we fix this issue we may discuss again if we want to use a float instead.

I also stumbled upon this while digging through the shutdown timeout issues. There are several inconsistencies which should not exist without good reasons:

  1. milliseconds vs. seconds
  2. timeout vs. timeout_millis naming sometimes with unit, sometimes without
  3. Float vs. int

I'm concerned that will be too disruptive for us to make the change proposed by @ocelotl and fix this in a minor version. Users might ignore or otherwise not see the warnings. Some other thoughts

  • Since this is only affecting the OTLP exporter and not the "core" API/SDK, wdyt of making a new major version release?
  • What are the consequences of NOT fixing the issue and keeping things as they are?

I agree that this need to be thought through carefully and maybe we can come up with better approaches thinking this through further. However, I think it's important for a project to be able to do changes of such kind in some way. Otherwise historic issues will pile up over time making the project unusable at some point.

I'm fine with adding a warning but to ease the conversion for distros I would add a configuration option to choose the timeout unit, explicitly e.g. OTEL_PYTHON_EXPORTER_UNIT_TEMPORARY_COMPAT=[s|ms] or implicitly e.g. OTEL_PYTHON_EXPORTER_SECONDS_TEMPORARY_COMPAT=[true|false].

This would enable users to make the change of their configuration in a minor release and prepare for the major bump which would then change the behavior. I understand this as the main point of the proposal, right? Without an additional warning about the future incompatibility, I doubt many users would use this option. In the end, most people read change logs retrospectively in case they experience any issue.

Also, even if they notice this, changing the deployments to introduce this temporary environment variable seems to be work that people would simply not do.

I think most users would just do the major version bump and in the best case they read about the change in advance and would change their configuration together with the update. In many cases they won't do this in the same step, because they did not read the changelog, then run into problems, look at the changelog and then change their configuration.

xrmx commented 1 month ago

I'm fine with adding a warning but to ease the conversion for distros I would add a configuration option to choose the timeout unit, explicitly e.g. OTEL_PYTHON_EXPORTER_UNIT_TEMPORARY_COMPAT=[s|ms] or implicitly e.g. OTEL_PYTHON_EXPORTER_SECONDS_TEMPORARY_COMPAT=[true|false].

This would enable users to make the change of their configuration in a minor release and prepare for the major bump which would then change the behavior. I understand this as the main point of the proposal, right? Without an additional warning about the future incompatibility, I doubt many users would use this option. In the end, most people read change logs retrospectively in case they experience any issue.

Also, even if they notice this, changing the deployments to introduce this temporary environment variable seems to be work that people would simply not do.

Not really, this would let upstream to keep the same behavior for a bit more time but users (including downstream distros) willing to switch will have the ability to do it.

I think most users would just do the major version bump and in the best case they read about the change in advance and would change their configuration together with the update. In many cases they won't do this in the same step, because they did not read the changelog, then run into problems, look at the changelog and then change their configuration.

I'm proposing a way to configure the behavior, I'm not proposing to change the behavior at the same time.

LarsMichelsen commented 1 month ago

Not really, this would let upstream to keep the same behavior for a bit more time but users (including downstream distros) willing to switch will have the ability to do it.

Ah, thanks for the clarification. I guess I am just missing the broader picture to get that. Is there a place where I can read more about the greater picture, like the downstream distros you mentioned?

xrmx commented 1 month ago

Not really, this would let upstream to keep the same behavior for a bit more time but users (including downstream distros) willing to switch will have the ability to do it.

Ah, thanks for the clarification. I guess I am just missing the broader picture to get that. Is there a place where I can read more about the greater picture, like the downstream distros you mentioned?

We are trying to define the broader picture as there are different opinions :)

Me and others participating in this project are working for companies building their own distributions (see https://opentelemetry.io/docs/concepts/distributions/)

aabmass commented 1 month ago

Also, even if they notice this, changing the deployments to introduce this temporary environment variable seems to be work that people would simply not do.

I think most users would just do the major version bump and in the best case they read about the change in advance and would change their configuration together with the update. In many cases they won't do this in the same step, because they did not read the changelog, then run into problems, look at the changelog and then change their configuration.

Yes I agree. But in the case of a minor version release, I think the user will rightfully come here complaining that we broke things. For a major version, it should be more obvious and it also signals which package might have created the breaking change.

Question for maintainers: is there any downside to a major bump beyond that we need to potentially backport security fixes for some finite window?

aabmass commented 1 month ago

Follow up to the SIG discussion:

@xrmx's concern is fixing this in a timely manner.

@open-telemetry/python-maintainers is there any objection or downside to doing a major version bump? The only thing I can think of is that it will take us out of locksetup.

lzchen commented 1 month ago

is there any objection or downside to doing a major version bump? The only thing I can think of is that it will take us out of locksetup.

Yes the only downside is the maintenance overhead. I think this is a good time to come with an automated process of handling packages that are not released in lockstep (especially now that sdk-extension-aws is also being added to the release cycle: https://github.com/open-telemetry/opentelemetry-python-contrib/issues/1899). Currently, there is not a clean way to do this and may require a bit of complicated version management in our scripts.

xrmx commented 1 month ago

This may be the source of the confusion where the timeout are expressed in seconds: https://opentelemetry.io/docs/specs/otel/protocol/exporter/