vectordotdev / vector

A high-performance observability data pipeline.
https://vector.dev
Mozilla Public License 2.0
17.5k stars 1.53k forks source link

Support sinking events to Pyroscope #21057

Open ryanartecona opened 1 month ago

ryanartecona commented 1 month ago

A note for the community

Use Cases

I'm trying to get Pyroscope data flowing through Vector (well, a vector-to-vector pair of a Vector agent in a source cluster to an "aggregator" in the destination cluster). Pyroscope supports 2 methods of ingest from its language-specific SDKs, an HTTP POST API which supports multipart/form-data uploads, and a gRPC Push service.

I have a client of each type—

  1. a Grafana Alloy agent collecting ebpf-based profiles which are sent using the gRPC Push service, and
  2. a node.js app running pyroscope-nodejs which uses the multipart/form-data HTTP upload method.

In this case I don't need Vector to have an internal data model for profiles. I'd be happy if they were treated as Log events, with the contents being mostly an opaque binary payload (a gzipped pprof message, which is itself a protobuf) and a set of label names/values with a certain structure.

Attempted Solutions

With some creative config, I could get both a gRPC source and an HTTP multipart upload source working. I was unable to get either a gRPC sink or an HTTP upload sink working, though, which is what I'm blocked the hardest on.

gRPC source

Somewhat surprisingly, I was able to get Vector to receive the gRPC Push messages by using a type: http source, like below.

Details

Using a proto desc file from running this in the pyroscope repo: protoc -Iapi -o pyroscope_push_v1.desc api/push/v1/push.proto --include_imports

  sources:
    pyroscope_grpc_push_raw:
      type: http_server
      address: '0.0.0.0:4050'
      framing:
        method: bytes
      decoding:
        codec: protobuf
        protobuf:
          desc_file: /etc/vector-proto/pyroscope_push_v1.desc
          message_type: "push.v1.PushRequest"
      strict_path: false

HTTP multipart source

I struggled to get this working, but I was eventually able to with some hacks. I could use a separate type: http source (below) which captures the Content-Type header containing the multipart boundary token (i.e. Content-Type: multipart/form-data; boundary=---------abcd1234). I could then write some hacky VRL which does some crude multipart upload parsing and pulls out the binary profile payload (a gzipped protobuf). The main friction is that some of the string manipulation methods in VRL, namely split(), will force a lossy utf8 encoding under the hood, which corrupts the gzip payload. The workaround makes the multipart upload parser even cruder, but it's at least possible by using find() and slice() instead of split.

Details
  sources:
    pyroscope_ingest_raw:
      type: http_server
      address: '0.0.0.0:4051'
      framing:
        method: bytes
      decoding:
        codec: bytes
      strict_path: false
      # capture known query params and headers used by the pyroscope sdk
      query_parameters:
        - from
        - until
        - name
        - spyName
        - sampleRate
        - format
        - units
        - aggregationType
      headers:
        - Content-Type

gRPC sink

I couldn't get a gRPC sink working at all. I can successfully re-encode a gRPC Push message using encode_proto(), but a type: http sink uses HTTP/1.1 and the Pyroscope gRPC server doesn't accept it.

HTTP upload sink

The Pyroscope HTTP Ingest API will accept either a multipart/form-data upload, like the nodejs SDK sends, or just a simple POST with the pprof profile as the request body. However, in both cases, it expects metadata including service name and labels in the form of URL query params, which means those have to be dynamically generated per Log event from Vector's perspective. Vector currently doesn't support dynamic values in the uri: field of the HTTP sink, and there's no way to specify query params separately (like headers:).

Proposal

On the source side—

On the sink side—

References

No response

Version

0.40.0

jszwedko commented 1 month ago

Thanks for this detailed feature request @ryanartecona !

Given you say that you'd be happy if Vector treated the incoming data as opaque, I'm wondering what you plan to use Vector to do with the data? Are you intending to just "proxy" the requests?

On the source side—

  • a decode_multipart_form_data() VRL function would be hugely helpful. It's not a hard blocker, as I was able to roll my own crude parser in VRL, but I'd love to be able to delete that code and use something built into VRL.

This seems like a reasonable addition. I could also see enhancing the http_server source to be able to handle multi-part data as a first-class concept (though I'm not sure exactly what this would look like).

  • a specific type: pyroscope_grpc source might have been nice, but not a huge deal as a type: http source with a protobuf encoding seems to work

Agreed. I could see it being useful to add for discoverability, but it seemingly could be a simple wrapper around the http_server source.

On the sink side—

  • Adding a type: grpc sink would be ideal. If Vector had a generic gRPC sink, I could use that for both source types and just restructure the payloads to fit the schema.

Agreed. I'm not sure if it is possible to create a dynamic gRPC sink in Rust though. The existing sinks that use gRPC use code generation. It seems like something should be doable using prost_reflect though.

  • If a gRPC sink can't be added or would take longer, supporting dynamic uri: field and/or query_parameters: field with dynamic values in the HTTP sink would suffice.

Agreed, these would be useful in their own right. Related issues:

ryanartecona commented 1 month ago

Given you say that you'd be happy if Vector treated the incoming data as opaque, I'm wondering what you plan to use Vector to do with the data? Are you intending to just "proxy" the requests?

Mostly yes. We also have Vector doing some extra things like tag insertion which are convenient to also do in VRL for these profiles.

I could also see enhancing the http_server source to be able to handle multi-part data as a first-class concept (though I'm not sure exactly what this would look like).

Even better! I like it.

Agreed. I'm not sure if it is possible to create a dynamic gRPC sink in Rust though. The existing sinks that use gRPC use code generation. It seems like something should be doable using prost_reflect though.

Ohh, that's unfortunate. I was hoping it would be an easier addition from existing pieces, since I knew the vector source/sink components existed, but I forgot about the gRPC codegen part.

Thanks for linking those other issues. I had seen #201 but not #6759. Upvoted.

Should I file other issues for any of those specific pieces?

jszwedko commented 1 month ago

Should I file other issues for any of those specific pieces?

I think it'd be reasonable to open separate issues for: