ekquasar commented 11 months ago

What went wrong?

In the Python otel getting started docs, under the run-the-instrumented-app section, there is example output that user is to expect from the application. Reproduced from the docs:

{
    "name": "/rolldice",
    "context": {
        "trace_id": "0xdb1fc322141e64eb84f5bd8a8b1c6d1f",
        "span_id": "0x5c2b0f851030d17d",
        "trace_state": "[]"
    },
    "kind": "SpanKind.SERVER",
    "parent_id": null,
    "start_time": "2023-10-10T08:14:32.630332Z",
    "end_time": "2023-10-10T08:14:32.631523Z",
    "status": {
        "status_code": "UNSET"
    },
    "attributes": {
        "http.method": "GET",
        "http.server_name": "127.0.0.1",
        "http.scheme": "http",
        "net.host.port": 8080,
        "http.target": "/rolldice?rolls=12",
        "net.peer.ip": "127.0.0.1",
        "http.user_agent": "curl/8.1.2",
        "net.peer.port": 58419,
        "http.flavor": "1.1",
        "http.route": "/rolldice",
        "http.status_code": 200,
...

Notice the richness of the information included: HTTP method, port, peer IP, route, status, etc. are included. However, when running the demo as-is, none of this information is included in the server logs. I've attached the full logs to this ticket (rolldice.log below) and reproduced some of the output here for ease of viewing:

{
    "body": "\u001b[31m\u001b[1mWARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead.\u001b[0m\n * Running on http://127.0.0.1:8080",
    "severity_number": "<SeverityNumber.INFO: 9>",
    "severity_text": "INFO",
    "attributes": {
        "otelSpanID": "0",
        "otelTraceID": "0",
        "otelTraceSampled": false,
        "otelServiceName": "dice-server"
    },
    "dropped_attributes": 0,
    "timestamp": "2023-11-27T22:09:01.452634Z",
    "trace_id": "0x00000000000000000000000000000000",
    "span_id": "0x0000000000000000",
    "trace_flags": 0,
    "resource": "BoundedAttributes({'telemetry.sdk.language': 'python', 'telemetry.sdk.name': 'opentelemetry', 'telemetry.sdk.version': '1.21.0', 'service.name': 'dice-server', 'telemetry.auto.version': '0.42b0'}, maxlen=None)"
}
{
    "body": "\u001b[33mPress CTRL+C to quit\u001b[0m",
    "severity_number": "<SeverityNumber.INFO: 9>",
    "severity_text": "INFO",
    "attributes": {
        "otelSpanID": "0",
        "otelTraceID": "0",
        "otelTraceSampled": false,
        "otelServiceName": "dice-server"
    },
    "dropped_attributes": 0,
    "timestamp": "2023-11-27T22:09:01.453917Z",
    "trace_id": "0x00000000000000000000000000000000",
    "span_id": "0x0000000000000000",
    "trace_flags": 0,
    "resource": "BoundedAttributes({'telemetry.sdk.language': 'python', 'telemetry.sdk.name': 'opentelemetry', 'telemetry.sdk.version': '1.21.0', 'service.name': 'dice-server', 'telemetry.auto.version': '0.42b0'}, maxlen=None)"
}
{
    "body": "me is rolling the dice: 2",
    "severity_number": "<SeverityNumber.WARN: 13>",
    "severity_text": "WARNING",
    "attributes": {
        "otelSpanID": "0",
        "otelTraceID": "0",
        "otelTraceSampled": false,
        "otelServiceName": "dice-server"
    },
    "dropped_attributes": 0,
    "timestamp": "2023-11-27T22:09:22.982898Z",
    "trace_id": "0x00000000000000000000000000000000",
    "span_id": "0x0000000000000000",
    "trace_flags": 0,
    "resource": "BoundedAttributes({'telemetry.sdk.language': 'python', 'telemetry.sdk.name': 'opentelemetry', 'telemetry.sdk.version': '1.21.0', 'service.name': 'dice-server', 'telemetry.auto.version': '0.42b0'}, maxlen=None)"
}

Why care?

Notice that it extracts none of the richness (HTTP method etc.) shown in the docs. Were that the only issue, maybe that's ok, but also notice that it zeros out all the trace and span IDs. That seems like more of an issue.

Moreover, none of the histogram information e.g.


                    "bucket_counts": [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
                    "count": 1,
                    "explicit_bounds": [
                      0, 5, 10, 25, 50, 75, 100, 250, 500, 1000
                    ],
                    "max": 1,
                    "min": 1,
                    "start_time_unix_nano": 1666077040063027610,
                    "sum": 1,
                    "time_unix_nano": 1666077098181107419

as shown on the docs was produced after running the demo, curling upwards of 30 times, and waiting upwards of 60 seconds. cat rolldice.log | grep bucket to confirm.

rolldice.log

cartermp commented 11 months ago

This sounds like autoinstrumentation didn't execute. Could this be related to your python version?

ekquasar commented 11 months ago

I suppose it could be related, but the docs say Python 3.6 and higher (https://opentelemetry.io/docs/instrumentation/python/#version-support). I'm using Python 3.11.6. It would be great to get an informative error in such a case, I'm happy to make that PR, were the issue to truly be related to an unsupported Python version.

cartermp commented 11 months ago

Yeah, I'd be shocked if 3.11 didn't work since it's not exactly hot off the presses. It's not always the case that instrumentation failing or something not getting initialized fails loudly, since doing so could be a bad time in production scenarios (imagine failing to add the python instrumentation agent on a live app and that crashing it, bringing down the app).

@open-telemetry/opentelemetry-python-contrib-approvers any ideas why autoinstrumented spans aren't getting created?

ekquasar commented 11 months ago

Here's e.g. what Datadog does - their auto-instrumentation provides discoverable warning messages.

Screenshot 2023-11-28 at 5 47 31 PM

srikanthccv commented 11 months ago

A recent change made to flask instrumentation is the reason https://github.com/open-telemetry/opentelemetry-python-contrib/pull/1980. Here is the issue https://github.com/open-telemetry/opentelemetry-python-contrib/issues/2058 and PR https://github.com/open-telemetry/opentelemetry-python-contrib/pull/2013 that supports flask 3.x which also fixes this issue.

Auto instrumentation also emits yummy informative messages https://github.com/open-telemetry/opentelemetry-python-contrib/blob/4bf3577fb76480dacca12eecd8ed93cb9fccc274/opentelemetry-instrumentation/src/opentelemetry/instrumentation/auto_instrumentation/_load.py#L79-L84 when it fails to install or instrument a framework. In this case, it was dependency conflict conflict DependencyConflict: requested: "werkzeug<3.0.0" but found: "werkzeug 3.0.1". These debug logs don't show up until this is supported https://github.com/open-telemetry/opentelemetry-python/issues/1059. Perhaps docs/getting started should also pin the opentelemetry-distro to avoid the surprises and update occasionally or there could be some docs tests that could assert the output of the instrumentation.

ekquasar commented 11 months ago

Automatic tests for the code in the docs would be premium.

svrnm commented 11 months ago

Automatic tests for the code in the docs would be premium.

I agree. If you like to help with that, let us know, I have some requirements for that in my head, but never came to write down an issue for that yet.

ekquasar commented 11 months ago

Excellent, is this a good place to get stakeholder feedback on those requirements or does this community use another forum for that?

svrnm commented 11 months ago

Excellent, is this a good place to get stakeholder feedback on those requirements or does this community use another forum for that?

GitHub issues is the right place for it, but we also have bi-weekly SIG meetings & you can find folks at #otel-comms on CNCF slack, see https://github.com/open-telemetry/community for more details. If you are willing to invest time and resources to tackle that problem, I recommend that you raise an issue for that and collect some feedback first, because some people will have thought about this problem in the past already and can provide some valuable feedback

ekquasar commented 11 months ago

Unrelated to above discussion of CI, I reported this issue in detail in the otel-python repo: https://github.com/open-telemetry/opentelemetry-python-contrib/issues/2079

Re: above discussion of CI: ACK I'll go to the Slack and start collecting stakeholder input.

UPDATE: see https://cloud-native.slack.com/archives/CJFCJHG4Q/p1701449543685509 for ongoing discussion.

svrnm commented 11 months ago

UPDATE: see cloud-native.slack.com/archives/CJFCJHG4Q/p1701449543685509 for ongoing discussion.

Please create a dedicated issue first

cartermp commented 11 months ago

Hrm. At best the action we'd take would be to temporarily roll back everything python-instrumentation-related until such a time that the issues @srikanthccv mentioned are fixed and rolled out.

I've worked at a place where we had live-running code samples and it was an absolute pain to maintain long-term. We ended up rolling back to just pasting code samples inline since the vast majority of the time they just work.

ekquasar commented 11 months ago

The issue is, for Python at least, I haven't yet found a single piece of the documentation that works as advertised. Literally not a one.

There may be some creative middle ground, regardless. For example, the example code samples that are linked to by the docs could have CI enforced on them - so we're guaranteed at least one working example.

Or something like this - and then disclaimer on the docs that they are untested and not expected to work as is, then point to the working examples.

UPDATE: I found a working example - the manual implementation at https://opentelemetry.io/docs/instrumentation/python/automatic/example/#manually-instrumented-server appears to work.

cartermp commented 11 months ago

Yeah, I think lacking https://github.com/open-telemetry/opentelemetry-python-contrib/pull/2013 is why you're seeing what you're seeing. This is just an extremely impactful bug that hits people everywhere, since "autoinstrument your flask app" is one of the most basic operations to do when starting with OTel + Python.

So I don't want to say no to the idea of some more automation to at least catch when a basic scenario like this just completely doesn't work. But at the same time, I think it's a reasonable assumption for anyone to have that a scenario like this should always work, and that anyone who documents how to do it shouldn't need to have a last-known-good-version in their back pocket to fall back on each time (nor should we necessarily have to pepper banners on the docs saying "doesn't quite work yet, sorry!").

I don't have a great alternative idea, though. It's just that maintenance along these lines will be fairly expensive.

svrnm commented 10 months ago

Let's discuss the CI stuff on the dedicated issue for this and get back to the original issue.

@srikanthccv:

Perhaps docs/getting started should also pin the opentelemetry-distro to avoid the surprises and update occasionally or there could be some docs tests that could assert the output of the instrumentation.

Agreed, we do the same for Java and other languages already, including a workflow to autoupdate versions, let me look into that.

svrnm commented 10 months ago

Werkzeug==2.3.7 does the trick, cc @luke6Lh43

Let's fix that and also work on pinning the versions of other packages

cwegener commented 10 months ago

From this discussion, I don't understand which of the changes in Werkzeug 3.0.0 is causing the issue. @svrnm Are you able to clarify what is causing the issue?

svrnm commented 10 months ago

See this previous comment by @srikanthccv: https://github.com/open-telemetry/opentelemetry.io/issues/3610#issuecomment-1834143783, which eventually leads to this PR pinning werkzeug < 3 for the instrumentation: https://github.com/open-telemetry/opentelemetry-python-contrib/pull/1980

cwegener commented 10 months ago

I think I understand now.

Flask 3.0 is not supported (yet), pending https://github.com/open-telemetry/opentelemetry-python-contrib/issues/1975

When the SDK was updated to restrict the supported Flask version to <3.0 (since, 3.0 support is still in development), the docs were not updated to reflect the change? Am I getting this right?

So, this is where the whole "automated Documentation testing" discussion comes in, right?

svrnm commented 10 months ago

No, the docs were updated with the <3 restriction for flask (see https://github.com/open-telemetry/opentelemetry.io/pull/3359), but at that point in time werkzeug was still in 2.x, which changed afterwards (https://pypi.org/project/Werkzeug/#history) and led to a new issue.

cwegener commented 10 months ago

and led to a new issue.

I think that was my question :joy: ... what is the issue? What is causing it?

open-telemetry / opentelemetry.io

Python "getting started" demo does not produce output as shown on instructions #3610

What went wrong?

Why care?