Errors are being raised in API and SDK

Our most basic example is this one:

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import (
    BatchSpanProcessor,
    ConsoleSpanExporter,
)

trace.set_tracer_provider(TracerProvider())
trace.get_tracer_provider().add_span_processor(
    BatchSpanProcessor(ConsoleSpanExporter())
)
tracer = trace.get_tracer(__name__)
with tracer.start_as_current_span("foo"):
    print("Hello world!")

If we remove the argument "foo" in this line:

with tracer.start_as_current_span("foo"):

then running this example fails with this error:

Traceback (most recent call last):
  File "/home/ocelotl/github/ocelotl/opentelemetry-python/docs/examples/basic_tracer/basic_trace.py", line 27, in <module>
    with tracer.start_as_current_span():
  File "/home/ocelotl/.pyenv/versions/3.9.5/lib/python3.9/contextlib.py", line 244, in helper
    return _GeneratorContextManager(func, args, kwds)
  File "/home/ocelotl/.pyenv/versions/3.9.5/lib/python3.9/contextlib.py", line 87, in __init__
    self.gen = func(*args, **kwds)
TypeError: start_as_current_span() missing 1 required positional argument: 'name'

This should not happen, the spec states here:

API methods MUST NOT throw unhandled exceptions when used incorrectly by end users.

Our SDK methods are also unsafe, we even raise exceptions directly. Even if we did not raise an exception there, and instead we logged an error we must end up returning a No Op object, as required by the spec (which we don't do):

Whenever API call returns values that is expected to be non-null value - in case of error in processing logic - SDK MUST return a "no-op" or any other "default" object that was (ideally) pre-allocated and readily available. This way API call sites will not crash on attempts to access methods and properties of a null objects.

That means that just wrapping the code of every method of every class or every function in a try except and logging an error is not enough. We need a way to define these NoOps and return them when necessary.

Given the dynamic nature of python, I think we need to clearly define what is considered as a incorrect use by users (and make sure the overhead introduced by this should be kept low and insignificant). I personally don't think the example highlighted in the issue description (and other cases along similar lines) is something we should bother about and handle gracefully.

The example highlighted above is significant, it can cause the application code to crash. As stated here:

We assume that users would prefer to lose telemetry data rather than have the library significantly change the behavior of the instrumented application.

I agree with @lonewolf3739. The way I understand this, we need to guard against unexpected failures crashing services. Things like network and I/O operations failing. We should defensively program such operations to ensure any failures arising from them do not bubble up and crash a service. I don't think it means we make it technically impossible for an exception to be raised especially from obvious wrong usage. I don't think we should protect against users passing wrong number of arguments to a function or trying to call an attribute as if it was a function.

IMO wrong usage is something like a user passing a span to a function that expect a span context or passing an incompatible callback pre/post request to let's say the Flask instrumentation. We need to guard against any exceptions raised by such usage. In all such cases, IMO, the functions themselves should be coded defensively to guard against such cases.

We can make our guards less guarding, we can make it possible to protect against some errors and not against some others. But why? The prototype in #2152 already guards against bad arguments being passed and any error happening in the inside of functions. The intention of OpenTelemetry is clearly stated here:

We assume that users would prefer to lose telemetry data rather than have the library significantly change the behavior of the instrumented application.

I say we guard against all errors we can possibly guard.

Hi all! For clarity, never throwing exceptions or returning errors is a strict requirement for the OpenTelemetry API.

Errors should definitely be piped to some form of diagnostic output so that developers can debug their code. But they should never have to guard an OpenTelemetry API call.

There are two reasons for this:

OTel should not be dangerous to use, even a tiny bit.
There is no realistic way for users to handle API errors effectively at runtime. What would they put in the "catch" clause? The SDK must always handle the errors and make the correct response.

In the case of bad parameters, methods that return an object should still return something functional. They should just use default parameters instead of the missing/malformed input. The default parameters can still indicate an error in some way - naming the span "UNNAMED SPAN" for example.

Hope that makes sense.

How to handle missing/bad parameters should be clarified more in the spec. It would be best if all implementations did the same thing. We also need to clarify how diagnostic data works in general - where should it be sent, what should it include, etc.

My plan was to form a working group on diagnostics once we're finished with the current metrics and instrumentation work. But if others want to bottom line that work, by all means. :)

I have been researching how to implement this mechanism in Python.

For the time being, let's forget about the discussion regarding if we should or should not catch exceptions caused by wrong parameters being passed to methods or classes, let's consider only a mechanism that deals with exceptions raised by the code "inside" a method or function.

So far I don't have any proof that it is impossible to implement, but I am starting to notice that several restrictions will probably have to be applied in the implementation and design of our project. I'll be listing the issues I find here:

First, we need to define what our API is. I am defining it like so:

Public methods
Public classes with their respective methods

Second, we need to define what this mechanism does, I understand this mechanism as something that does the following when an exception is raised in a method:

Catches the exception
"Raises" a warning
Returns a predefined value for each method

This mechanism is being implemented in the safety decorator:

@safety("predefined_string")
def function(a, b, c) -> str:
    # do something that may raise an exception

Pretty much the same happens for class methods:

@safety("predefined_string")
def function(a, b, c) -> str:
    # do something that may raise an exception

So, if the code in function raises an exception, a warning is "raised", then "predefined_string" is returned. The execution of the entire application continues afterwards.

I am trying to apply this mechanism to our code to find out what issues arise:

Unsafe parent classes

Consider this scenario:

class Parent:
    def parent_method(a, b):
        # This code is unsafe

class Child(Parent):
    @safety(None)
    def child_method(a, b):
        # This code is safe

Calling Child().parent_method(1, 2) would be unsafe. We could of course override parent_method in Child and use super and safety to make it a safe method, but keep in mind that the parent class may have many methods (Parent could be a class from a standard library like MutableMapping, for example), including the magic ones like __getitem__ or so. We can't use __getattribute__ to automatically redirect calls to Child methods to a safe version of the parent method because we do not know beforehand the value that we need to return for every method in case an exception is raised.

I have been researching how to implement this mechanism in Python.

For the time being, let's forget about the discussion regarding if we should or should not catch exceptions caused by wrong parameters being passed to methods or classes, let's consider only a mechanism that deals with exceptions raised by the code "inside" a method or function.

So far I don't have any proof that it is impossible to implement, but I am starting to notice that several restrictions will probably have to be applied in the implementation and design of our project. I'll be listing the issues I find here:

First, we need to define what our API is. I am defining it like so:

Public methods

Public classes with their respective methods

Second, we need to define what this mechanism does, I understand this mechanism as something that does the following when an exception is raised in a method:

Catches the exception

"Raises" a warning

Returns a predefined value for each method

This mechanism is being implemented in the safety decorator:
@safety("predefined_string")
def function(a, b, c) -> str:
    # do something that may raise an exception
Pretty much the same happens for class methods:
@safety("predefined_string")
def function(a, b, c) -> str:
    # do something that may raise an exception
So, if the code in function raises an exception, a warning is "raised", then "predefined_string" is returned. The execution of the entire application continues afterwards.

I am trying to apply this mechanism to our code to find out what issues arise:

Unsafe parent classes

Consider this scenario:
class Parent:
    def parent_method(a, b):
        # This code is unsafe

class Child(Parent):
    @safety(None)
    def child_method(a, b):
        # This code is safe
Calling Child().parent_method(1, 2) would be unsafe. We could of course override parent_method in Child and use super and safety to make it a safe method, but keep in mind that the parent class may have many methods (Parent could be a class from a standard library like MutableMapping, for example), including the magic ones like __getitem__ or so. We can't use __getattribute__ to automatically redirect calls to Child methods to a safe version of the parent method because we do not know beforehand the value that we need to return for every method in case an exception is raised.

This problem happened to me while trying to make BoundedAttributes safe. Just to be clear, I don't mean to say that this safety-focused design requirement is impossible or even hard to achieve, on the contrary, I think it is doable if we are very strict with what we expose in the API. We should keep the API minimal, it should only have what is defined in the spec and its classes should not have parents outside of our project.

We do have code in the API package that is not part of the spec. I understand the reason why it is there, I believe this code is there because it is being used in several places around the API and SDK, it is utilitarian code. In order to not duplicate it, we added it into the API package because the API package is the root dependency, so it would be available for every other component of our project.

I now think that was not the right approach. I am not sure where this code should be, I am now inclined to think this code should be in the SDK and each SDK should have its own utilitarian library. Code may end up being repeated, yes, but it is one for the other and safety is a specification requirement, code economy is not.

The central design principle here is that every call that the user makes is done by calling an API function/method. Those functions/methods are safe. Those functions/methods pass the arguments to the SDK. If some exception is raised in the SDK code, it will be caught by the API safety mechanism.

When an SDK function/method returns an OpenTelemetry defined object an SDK object is not returned, an API object is returned. Why? Because it is the API objects the ones who have the safety mechanism in their methods, so if a user is to call one of these methods, it must be an API one. The API object method then redirects the call to an underlying SDK object.

The user only knows API proxies.

open-telemetry / opentelemetry-python

Errors are being raised in API and SDK #2148

Unsafe parent classes

Unsafe parent classes