mitodl / ol-infrastructure

Infrastructure automation code for use by MIT Open Learning
BSD 3-Clause "New" or "Revised" License
43 stars 4 forks source link

Add OpenTelemetry integration for Open edX projects #827

Open blarghmatey opened 2 years ago

blarghmatey commented 2 years ago

User Story

Description/Context

Currently all projects under the Open edX umbrella have a default integration with NewRelic and no support out of the box for any other monitoring service. Fortunately, there are a couple of central abstractions that allow for hooking in additional providers.

For any application that is built on top of Django we can extend the monitoring module in the edx-django-utils package. It is currently hard-coded to use Newrelic if the agent is installed, so we will need to work with the code owners to get buy-in on any configuration interface that we would like to support. Possibly just adding a different conditional based on if the OpenTelemetry library is present?

The front-end applications mostly rely on code in the frontend-platform repository to handle how JS logs are tracked and managed as metrics. We can use the interface defined there to add an integration for OpenTelemetry as well so that operators of Open edX projects have a more open and flexible option for collecting metrics about their system.

Acceptance Criteria

# Tasks
- [ ] Add an Open Telemetry implementation to edx-django-utils
- [ ] Add an Open Telemetry implementation to the Open edX frontend-platform
- [ ] Configure edX projects to export instrumentation metrics via Open Telemetry
- [ ] Route Open Telemetry data to Grafana
shahbaz-shabbir05 commented 8 months ago

I've been doing some research, and I've found that OpenTelemetry (OTel) is a framework for observing and monitoring applications. It's quite stable for using it with Python to track and measure the performance of your code (like tracing and metrics). However, when it comes to logging, it's still in an experimental phase.

OTel doesn't provide its own backend solutions for observability; instead, it sets a standard for how you can track data in your applications, regardless of the specific tools you're using, whether it's Jaeger, Zipkin, or something else.

In our case, I am looking into implementing OTel in a Django project. "edx-django-utils" has a monitoring module you can extend. This package has a middleware class that helps you deal with various aspects, such as caching and monitoring. The actual implementation of monitoring is done in the middleware file in the internal directory within this package, and it currently uses New Relic for this purpose.

We need to create a separate directory for work and import the necessary classes in init file. It won't replace the existing functionality; it will simply add OTel-related code alongside it. You can make use of flags to decide whether to use the default New Relic functionality or your OTel code here. By default, it should use New Relic, but if the flag is set for OTel, it will use your new implementation.

When it comes to using OTel, my understanding is that it mainly needs to format the data in a way that complies with the OTel standard. The actual visualization and analysis of this data will be handled by third-party observability tools like Jaeger or Zipkin. It's worth noting that you can potentially use New Relic in conjunction with OTel to get the best of both worlds, but I'll need to explore this further.

shahbaz-shabbir05 commented 8 months ago

Post about open-telemetry on Openedx discussion: https://discuss.openedx.org/t/opentelemetry-integration-coming-soon/11404

arslanashraf7 commented 8 months ago

A couple of thoughts as we are moving forward in this.

Context:

I was looking around a bit and the thought of doing this through an open edX plugin. This crossed my mind initially but I wasn't sure if we were meant to add middleware and if the edX plugin's architecture supports adding our own middleware through an independent plugin.

I am not sure of the entire implementation but if we are going to add a middleware with some settings without any other complex changes in the edx-platform. We might just be able to do it through a plugin as well. (I also discussed this with Zia to check the feasibility of getting our changes accepted in the platform and his recommendation was also to go for a plugin if that's possible and a platform if not possible).

I think we should spend some time and give a try to add this through a plugin. Unless we don't hit something big we should be able to do the needful.


Proposed Basic Implementation:

It might be a good idea to know about Open edX Plugins to understand the below steps better.

The implementation in the case of the plugin would look something like this:

  1. Write our middleware, and other functionality.
  2. Add other settings as needed (e.g. A complex example is how we do it in Sentry Plugin or a simpler example like Canvas Integration plugin)
  3. Add the middleware at runtime (Like how third party auth app is doing in the platform)

@pdpinch @blarghmatey What are your thoughts on this?

shahbaz-shabbir05 commented 8 months ago

We have a variety of settings available in OpenTelemetry Django Instrumentation to align with different requirements. Your input on configuring these settings would be valuable:

Options to enrich SQL queries with additional contextual information:

SQLCOMMENTER_WITH_FRAMEWORK: Attach Django framework and version. (Default: True)
SQLCOMMENTER_WITH_CONTROLLER: Attach the controller name managing the request. (Default: True)
SQLCOMMENTER_WITH_ROUTE: Attach URL path managing the request. (Default: True)
SQLCOMMENTER_WITH_APP_NAME: Attach the app name managing the request. (Default: True)
SQLCOMMENTER_WITH_OPENTELEMETRY: Attach OpenTelemetry traceparent. (Default: True)
SQLCOMMENTER_WITH_DB_DRIVER: Attach DB driver name. (Default: True)

These can be set in settings.py.

To exclude specific URLs from tracking. export OTEL_PYTHON_DJANGO_EXCLUDED_URLS="client/.*/info,healthcheck"

To use Django’s request attributes as span attributes. export OTEL_PYTHON_DJANGO_TRACED_REQUEST_ATTRS='path_info,content_type'

Capture HTTP request headers: export OTEL_INSTRUMENTATION_HTTP_CAPTURE_HEADERS_SERVER_REQUEST="content-type,custom_request_header"

Capture HTTP response headers: export OTEL_INSTRUMENTATION_HTTP_CAPTURE_HEADERS_SERVER_RESPONSE="content-type,custom_response_header"

To avoid capturing sensitive information in headers. export OTEL_INSTRUMENTATION_HTTP_CAPTURE_HEADERS_SANITIZE_FIELDS=".*session.*,set-cookie"

@blarghmatey I am working to set up these settings in the best way possible. If you have any thoughts or insights on these options, please let me know. Thank you!

shahbaz-shabbir05 commented 8 months ago

I just wanted to give you a quick update on adding OpenTelemetry (OTel) to our system and ask for your advice.

I've got OpenTelemetry working in our LMS and CMS with the ol_openedx_otel_monitoring plugin. Currently, trace data is visible in the console for testing.

I’ve set it up to show data in the console for now, but we need to pick an exporter to use for the production environment. I’m also thinking about adding metrics data and could use some advice on good exporters for that.

There are a lot of settings in the OpenTelemetry Django package as I mentioned above, and I’m trying to figure out which ones we really need. I’ve also added a quick way to check by adding a healthcheck endpoint if the plugin is working right.

Looking forward, I’m not sure if we need to add OpenTelemetry to other parts of our system, like tracking caching and memory. Do you think what we have now is enough?

Right now, our main goal is just to generate the data. Should I focus just on tracking the trace data, or is there something else we should do?

For local development, I think we continue using the Console exporter. But for our live system, we need to decide on the best exporter to use. I understand Grafana is an option, but I need more info to make sure everything works well together.

We’ve decided to use the open-edx-plugin structure instead edx-django-utils. Any advice on that decision would be really helpful. @blarghmatey

shahbaz-shabbir05 commented 8 months ago

I'm currently looking into how we can manually adjust OTel in our system to match what we're doing with NewRelic. However, I need to understand it better before I can say more.

Right now, the plugin is just starting out and only shows trace data in the console. I've noticed each exporter needs its own set of environment variables or configuration values. I'll need to sort out these settings to get the exporter (which we will decide to use) working with our plugin.

I'm using the default settings for span values in trace data at the moment. I think we can change them to suit our needs, either through settings flags or manual adjustments, but this will need some research.

Also, we should check how these changes might affect our system's performance. @blarghmatey

blarghmatey commented 8 months ago

So, for collecting the data we will be using a local agent on the instance to route to the system where we will be storing an analyzing the data. The tool we use is Vector. That means that there shouldn't be any need to manage those configurations within the application itself.

The main benefit of doing the integration through the edx-django-utils was that it allows us to take advantage of the existing instrumentation using that library, but the OTel integration via the plugin is also useful.

blarghmatey commented 8 months ago

In terms of coverage, I think it's best to start small, see what information we get and how we can use it, and then we can decide what additional information we will need. The benefit of using our own plugin is that we have more control over when/how we add functionality. This will also be a useful exercise in understanding how to implement and use OTel for Django so that we can add that to our other applications.

shahbaz-shabbir05 commented 8 months ago

@blarghmatey Below is the current trace data:

Trace a7d2f6bc31363c23ed78e6b47e577840
| └── [00:44:54.440779] GET otel/healthcheck/, span 9f67c26faae6c416
|     ├── Kind : SERVER
|     ├── Attributes : 
|     │   ├── http.method : GET
|     │   ├── http.server_name : lms.devstack.edx
|     │   ├── http.scheme : http
|     │   ├── net.host.port : 18000
|     │   ├── http.host : 0.0.0.0:18000
|     │   ├── http.url : http://0.0.0.0:18000/otel/healthcheck/
|     │   ├── net.peer.ip : 172.23.0.1
|     │   ├── http.user_agent : Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) 
|     │   │   AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 
|     │   │   Safari/537.36
|     │   ├── http.flavor : 1.1
|     │   ├── http.route : otel/healthcheck/
|     │   └── http.status_code : 200
|     └── Resources : 
|         ├── telemetry.sdk.language : python
|         ├── telemetry.sdk.name : opentelemetry
|         ├── telemetry.sdk.version : 1.20.0
|         └── service.name : unknown_service
shahbaz-shabbir05 commented 8 months ago

As per my initial understanding, to send telemetry data to Vector, I need to choose an exporter in OTel that is compatible with the protocol that Vector is configured to receive. The exporter is responsible for converting the telemetry data (traces, metrics, and logs) into a format that can be understood by the receiving system, in this case, Vector. Vector supports various protocols for ingesting data, including but not limited to Jaeger, Prometheus, and OpenTelemetry Protocol (OTLP).

Do you have a specific exporter recommendation that I should explore further for our use case? @blarghmatey

shahbaz-shabbir05 commented 8 months ago

Adding to that: The OTLP (OpenTelemetry Protocol) exporter is a good option when you want to send data from applications that use OpenTelemetry to Vector. This is because Vector can get and understand data that is sent in the OTLP format. @blarghmatey

shahbaz-shabbir05 commented 8 months ago

In a nutshell, I've been working on integrating OTel with our plugin and have successfully managed to capture and export trace data. Initially, I displayed the trace data in the console as we were just trying things out. It is important to mention that while OTel's trace and metrics monitoring capabilities are stable, its ability to monitor logs is still under the experimental phase.

Then, we decided to use Vector for data collection since we are already utilizing it for logs and metrics. In my recent work, I successfully configured the OTLP Exporter in our plugin to export OTel trace data. However, I faced many problems while running the Vector locally and trying to send these traces to Vector.

Right now, I have found that Vector seems to only work with log events from OTel sources. This is a bit confusing because OTel’s logs are still in the experimental phase, but Vector is only supporting logs for OTel sources at the moment. This makes me wonder and think about what I should do next.

Now, I'm considering whether we should switch to a new tool for the data collection from OTel (traces), or if I should update our plugin to use OTel for sending log events to Vector. @blarghmatey

blarghmatey commented 8 months ago

Thank you for this research. Given the lack of support for metrics and traces (which are the most useful data sources for this tooling), we will want to use a different collector than Vector. The logical next choice given our metrics stack would be the Grafana Agent (https://grafana.com/docs/agent/latest/).

shahbaz-shabbir05 commented 8 months ago

Thank you for this research. Given the lack of support for metrics and traces (which are the most useful data sources for this tooling), we will want to use a different collector than Vector. The logical next choice given our metrics stack would be the Grafana Agent (https://grafana.com/docs/agent/latest/).

Are we not considering sending OTel traces to Grafana?

shahbaz-shabbir05 commented 8 months ago

From what I know, Grafana does not directly store trace or metric data. Instead, it displays data from other backend systems like Prometheus, Cortex, M3, etc., that actually store the information. Therefore, we must first send our OpenTelemetry data to one of these backend solutions. After that, we can use Grafana to visualize and analyze the data. Is my understanding accurate? @blarghmatey

blarghmatey commented 8 months ago

So, the link that I posted above is for an agent process that runs alongside the edX service to retrieve the OTel data directly and then handles routing it to the backend services that Grafana is querying. This would be similar to how Vector operates, but with better support for OTel inputs.

shahbaz-shabbir05 commented 8 months ago

I am reaching out to share details and seek clarification on setting up a Grafana Agent for visualizing OTel data within our edX service. Based on previous discussions, I understand that we need to run the Grafana Agent alongside our edX service, ensuring it can directly collect OTel data for visualization.

Integration with Grafana Agent:

Alternative: Using Prometheus:

Regarding OpenTelemetry Collector:

If anyone has any other requirements or suggestions that need to be considered, feel free to share. @blarghmatey

shahbaz-shabbir05 commented 8 months ago

I've made some changes to my plugin. Now, it can use either the console exporter (which is good for local testing) or the OTLP exporter. The console exporter is the default setting. I've also thrown in a rich console for when we're testing things out locally.

blarghmatey commented 8 months ago

Thank you for these details. To answer your question, no we do not need to configure the Grafana agent within the plugin. The data collection is the responsibility of operators and is out of scope for the application runtime.

shahbaz-shabbir05 commented 8 months ago
  1. As of now, Grafana only provides support for OTLP via HTTP.
  1. From the documentation for metrics:
shahbaz-shabbir05 commented 7 months ago

I was testing trace data from the OTel plugin and using the OTLP Exporter to send it directly to Grafana. I created a trial account on Grafana Cloud, set Tempo as a data source, and tried different settings to send data to my Grafana Cloud. But, I ran into some issues connecting to it.

I found out that the OTel Collector is better for processing, batching, and exporting data to most backends for later display. So, for testing purposes, I installed and configured the OTel Collector locally, updated my exporter configuration accordingly, and ran it. Although it was running, I couldn't send data to the collector due to connection refusal issues. I wanted to send data to the collector and from there to my Grafana Cloud as I updated the Collector config that way, but it didn't work.

Then, I pulled and ran Docker containers for OTel Collector, Grafana Tempo, and Grafana, ensured they were active, and added configurations for both Collector and Grafana Tempo in a YAML file. Even though everything was running, I still had trouble with connections or too many retries. I checked the URL, which was right, and tried other ways to fix it, but no luck. I can share the config details of Collector and Tempo here if required.

Next, I stopped all containers and updated the Collector configuration to target my Grafana Cloud trial account. Upon rechecking, I encountered a “404 page not found” error, indicating that while the connection was established, the URL did not yield any response, probably because Tempo doesn't have its own UI.

I integrated Tempo as a data source into my Grafana Cloud account to get data into Tempo and show it in the Grafana UI. Although Tempo was added as a data source for both live and local Grafana setups, I was still unable to export data to it.

Then, I attempted to export trace data directly to Tempo, bypassing the Collector, but I still received the same “page not found” error.

Now, I am seeking the actual URL for the Grafana Tempo service, which I can use to export trace data. This Tempo service should be configured as a data source in Grafana to visualize the traces. To proceed, I would require the following values:

With the correct settings, I hope to test my plugin with Grafana. Or, if possible, assist me with configuring Grafana locally would be appreciated. @blarghmatey @Ardiea

Ardiea commented 7 months ago

Subtask for future devops work implementing this in our environments: https://github.com/mitodl/ol-infrastructure/issues/1948

pdpinch commented 4 months ago

@Ardiea @blarghmatey should this be closed now?