grafana / mimir

Grafana Mimir provides horizontally scalable, highly available, multi-tenant, long-term storage for Prometheus.
https://grafana.com/oss/mimir/
GNU Affero General Public License v3.0
4.13k stars 529 forks source link

otlp: Mimir's OTLP endpoint to return marshalled proto bytes as response body #8185

Closed jesusvazquez closed 5 months ago

jesusvazquez commented 5 months ago

Describe the bug

The handler that wraps around the otlp endpoint on distributors replies with a normal string in the body which is the normal behavior for all Mimir endpoints but its not what the otel spec defines.

In https://opentelemetry.io/docs/specs/otlp/#failures-1 we can see that:

If the processing of the request fails, the server MUST respond with appropriate HTTP 4xx or HTTP 5xx status code. See the sections below for more details about specific failure cases and HTTP status codes that should be used.

The response body for all HTTP 4xx and HTTP 5xx responses MUST be a Protobuf-encoded Status message that describes the problem.If the processing of the request fails, the server MUST respond with appropriate HTTP 4xx or HTTP 5xx status code. See the sections below for more details about specific failure cases and HTTP status codes that should be used.

The response body for all HTTP 4xx and HTTP 5xx responses MUST be a Protobuf-encoded Status message that describes the problem.

Since Mimir only supports otlp through HTTP we can go to the otlphttpexporter in the collector and confirm that the client is expecting this marshaled proto of the Status struct here

        // Decode it as Status struct. See https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/protocol/otlp.md#failures
        respStatus = &status.Status{}
        err = proto.Unmarshal(respBytes, respStatus)
        if err != nil {
            return nil
        }

Now because Mimir replies with a normal string instead of whats expected by the client and spec, when collectors write to Mimir and there is a 400 error the entire error message by Mimir does not appear in the collector logs making it very hard to troubleshoot the original issue

Exporting failed. Dropping data. {"kind": "exporter", "datatype": "metrics", "name": "otlphttp", "error": "not retryable error: Permanent error: error exporting items, request to https://otlp-gateway-prod-us-east-0.grafana.net/otlp/v1/metri cs responded with HTTP Status Code 400", "droppeditems": 505}

To Reproduce

Steps to reproduce the behavior:

  1. Start latest Mimir
  2. Write a metric through the otlp endpoint with an attribute value longer than Mimir's 2048 byte limit.
  3. You'll see that the response you get is a normal body string instead of marshalled proto bytes.

Example script to reproduce this behavior

now=$(date +%s)

data="
{
    \"resourceMetrics\": [
      {
        \"resource\": {
          \"attributes\": [
            {
              \"key\": \"service.name\",
              \"value\": {
                \"stringValue\": \"amuwpkjykzrnguttaduwkqrdyiwhbpuaqjrgexjzzcwrqvnukgtwhrwacfmqgwvvrhpvqqieahfzjcherahydcqcprwcqxeytmbhrkaauxccgcqrvmpyvbmijawiizftmrktukfyqwmnuxxbrjtifigqrhyhcyaxwppkidvvjjaqevepxefbxpyezumyqyecttnvprruqyjrcarhwbtjqpwqpewmdfmexuthdgegrpnbnjdqxmgguyzmqtahgfmedctgzjapfpgtdmmgjikgmqvybpaezvveergkqqqjaymkxnhezhbrrhyypygvjvjpmekivhhfvftbhwzyxbrbemxyhwfpefzxmzhafzjnwyjgtkmcfdbveqagmkvcnfpexcaqamzvntptdfqvpdbcdbwgpdcrdmendmpyyerfurdhjhvtpfkvcnjqbvuyqpaqkiedkavtmnvznxhqynivriymuevdhdyauviahybgbnbudxrhybfbuabrjazgmmwjzgpwhnurtfaqqkgrvpyqexvggmwxqznhwxxfgeyewixaafynbmytijqyarkahwhghbujakdrarkyjxpjnfrxgnmvgvviikhfrgqxmkufxgxijrnuezefiwibqwpxjcqgqaafawwhmbbyynkbhadpjmmytrqtcwepnpfpmktxmnqjcradypktaupyikbcfccayxdxkvjuhkuiqxkahmwwxeqtnxhknzwuxrewnimwpavqtxxbqiytxityvfxawbiqzqphfkxgrkcwbdhepdygpyjhnpcagfqvepfkpywmbehqdbtvbqzbjjhgpgcejhyvceibtkhyxvrzmvghepubvdhvxuytdewuggypwarqzmujhgvaggzhyvmfnpzkwfkenhrjkruafyrivjzajntaykgmuccxrvrgtqtwfixnbpiyfcgdhxvenjguwxgghjixuhvdukjyamrtwazcbqixeywnbthycuafhahzkfqqwujkpizkcgcdivyxjjgwqixfbyrwpmfdbakrpqfzhamizbirmfjajkncjufiwveigvvzzrqbkebayfpgerjcjvwzuvxuwtxradqugizdubwupcvtqtvprwkwdpecnunuhmgnwvtmkcvrhmeanmccccdzhbprfijxdaykbizzmmyckatdmkxrpfgjndwymvabctfjmbimrjrpyqiiygeidcadgrhucqegwjqthbbewqutmjkcdwedzpidjhrjnikjkbbuebfrjmkkcupagxbmqjvtytuinbzupzqianuxgjahrfhkymuucamhfqbqcerzzbwfzikcryigzthehvajzmcecvbhzcuripfhaekknzevipdwuchatzpjnchrzbfihdiubtyxkqhxzbwkjffzygrhnbfgcnhxfayzeduxvnndktwhxqgbtumphrnxncqqedbyceupniktpytejicqpbpvhkqewegztrfigzguygidpbiguvydnkkbknxxzzjcfaayaihxampaknbjmvcndifdunkgymuwmcxcjhpfdzhnkvtcgexxmkhkmikcwnqimxbvijmkhfmurwneedibzbywxppvwukphpeazuchiqxyhavnjbjpibaxnxfaxgkvrdvjvtunnahexgjjexzkntzzcwuazfebregacfibknkyqqkywjxrmadnfwgzpftyigzgxdhgwdcdjqgnkhrxznbwcdgugemffdfvruivjjazuhbrhyxeuegyynmggcxjyxyfzuggzwvfqedgqngpkfkwrvbpqwywjzbifdquzhqtujqvrwhzpqvakbybaxfhwecajrhwhibzukjxymabcdwxymxtznjipfnmwqczddxrvxaktcbnhvwnxjrcfzeabiqyecqnuqberahtabivwkrvdkqjpevxmkejynuxwbmdadbtzxejfuxrvenemubqtrihuzwjakjrpgexwixgutdvcqkdtfpbujiayibfvhfdmtucvahfjhfvdpfjbgumgkafkyprgyjirvzhfaziyqahw\"
              }
            }
          ]
        },
        \"scopeMetrics\": [
          {
            \"metrics\": [
              {
                \"name\": \"your_test_metric\",
                \"unit\": \"s\",
                \"description\": \"\",
                \"gauge\": {
                  \"dataPoints\": [
                    {
                      \"asInt\": 1,
                      \"timeUnixNano\": ${now}000000000
                    }
                  ]
                }
              }
            ]
          }
        ]
      }
    ]
  }
"

curl \
  -k \
  -i \
  -XPOST \
  -H 'Content-Type: application/json' \
  -u "${tenant}:${apikey}" \
    https://mimir-address:port/otlp/v1/metrics -d "${data}"

And here a example run

❯ bash send-metric-that-fails.sh
HTTP/2 400
content-length: 668
content-type: text/plain; charset=utf-8
date: Mon, 27 May 2024 09:26:09 GMT
server: envoy
x-content-type-options: nosniff
x-envoy-upstream-service-time: 104
via: 1.1 google
alt-svc: h3=":443"; ma=2592000,h3-29=":443"; ma=2592000

received a series whose label value length exceeds the limit, label: 'job', value: 'amuwpkjykzrnguttaduwkqrdyiwhbpuaqjrgexjzzcwrqvnukgtwhrwacfmqgwvvrhpvqqieahfzjcherahydcqcprwcqxeytmbhrkaauxccgcqrvmpyvbmijawiizftmrktukfyqwmnuxxbrjtifigqrhyhcyaxwppkidvvjjaqevepxefbxpyezumyqyecttnvprru' (truncated) series: 'jesus_test_metric_seconds{job="amuwpkjykzrnguttaduwkqrdyiwhbpuaqjrgexjzzcwrqvnukgtwhrwacfmqgwvvrhpvqqieahfzjcherahydcqcprwcqxeytmbhrkaauxccgcqrvmpyvbmijawiizftmrktukfyqwmnuxxbrjtifigqrhyhcyaxwppkidvvj' (err-mimir-label-value-too-long). To adjust the related per-tenant limit, configure -validation.max-length-label-value, or contact your service administrator.

Expected behavior

The endpoint should reply with marshaled proto bytes and then we should expect to see the concise mimir errors on the collector logs.

cyrille-leclerc commented 5 months ago

As our recommended OTel ingestion architecture relies on the batching of the metrics, spans, and logs with the OpenTelemetry Collector Batch Processor (default send_batch_size=8912) and then are exported with the OTel Collector OTLP HTTP Exporter, shall we verify the error message is human readable through the OpenTelemetry Collector logs?

aknuds1 commented 5 months ago

@ying-jeanne and I are picking this up :) @jesusvazquez and I are leaning towards writing a new distributor.handler function specifically for the OTLP endpoint, since it will probably lend itself to easier maintenance than sharing handler with the normal remote write endpoint.

This OTLP specific handler should respond with Status protobuf messages for error cases.