Support optional query param "spanKind" when get operations for given service

jaegertracing / jaeger

CNCF Jaeger, a Distributed Tracing Platform

https://www.jaegertracing.io/

Apache License 2.0

20.36k stars 2.43k forks source link

Support optional query param "spanKind" when get operations for given service #1920

Open guo0693 opened 4 years ago

guo0693 commented 4 years ago

Requirement - what kind of business use case are you trying to solve?

We want to filter the operations of a service by span kind, those filtered operations can be used for features e.g. get dependencies graph for service:operation pair from server side (spanKind == "server")

Problem - what in Jaeger blocks you from solving the requirement?

Currently we only save operations indexed by service name and we can only query operations by service name. We will need to add extra info: spanKind when saving operations for a service.

Proposal - what do you suggest to solve the problem or improve the existing situation?

API changes: Restful:

GET /api/services/{serviceName}/operations : We will keep this endpoint unchanged for backward compatibility, it will still return all operations as a list of string for a given service.
GET /api/operations?service={service}&spanKind={spanKind} We will update this endpoint to return list of operation metadata for a given service whose spanKind matched the query parameter. If spanKind is empty, then return all operations of a service. Sample response:
```
{
"data": [
{
  "name": "A",
  "spanKind": "client"
},
{
  "name": "B",
  "spanKind": "server"
}
],
"total": 2,
"limit": 0,
"offset": 0,
"errors": null
}
```

GRPC: rpc GetOperations(GetOperationsRequest) returns (GetOperationsResponse) Add span_kind to the rpc request and response

message GetOperationsRequest {
  string service = 1;
  string span_kind = 2;
}

message Operation{
    string name = 1;
    string span_kind = 2;
}

message GetOperationsResponse {
  repeated OperationMeta operations = 1;
}

Interfaces: spanstore:

...
// Get operations by service and spanKind, empty spanKind means get operations for all kinds of span
GetOperations(ctx context.Context, query *spanstore.OperationQueryParameters) ([]*storage_v1.OperationMeta, error)
..

Span reader will need to take extra parameter to query operations by service name and span kind. All the storage plugins needs to be updated so that:

write operation index with service name and span kind
query operation index by service name and span kind, return all kinds of operations when span kind is empty.

Task List

[x] #1921 Update query service using the new request & response format (restful & grpc)
[x] #1921 Update cassandra & memory storage to support spanKind for read/write operation index
[ ] #1922 Update badger to support spanKind for read/write operation index
[ ] #1923 Update elasticsearch to support spanKind for read/write operation index
[ ] https://github.com/jaegertracing/jaeger-ui/issues/492 Update UI to use new endpoint to get a list of operations
Any open questions to address

burmanm commented 4 years ago

Couple of questions:

What happens to old data? It can't be queried this way if the information doesn't exist.
What defines the kind? The writing client? I don't see any links to the client changes, so is this something coming from for example OpenTelemetry?
Why not just a tag?

And of course tasteful thing, but "spanKind" inside Process doesn't really feel descriptive to me. Also, what is a server and what is a client? Is a request made by the first server to next server a client or a server?

objectiser commented 4 years ago

@guo0693 Agree with @burmanm - why just restrict to spanKind, when can support a list of tags, of which span.kind could be one that is queried?

guo0693 commented 4 years ago

Couple of questions:

What happens to old data? It can't be queried this way if the information doesn't exist.

What defines the kind? The writing client? I don't see any links to the client changes, so is this something coming from for example OpenTelemetry?

Why not just a tag?

And of course tasteful thing, but "spanKind" inside Process doesn't really feel descriptive to me. Also, what is a server and what is a client? Is a request made by the first server to next server a client or a server?

For existing data, spanKind will be empty and should be queried by "GET /api/operations?service={service}". If necessary, we can create migration script for specific storage plugin to manually add "" spanKind for each operation. We use cassandra at our company and when we update table schema to add a new column, no extra action needs to be taken.
span writer will get spanKind from "span.kind" tag, if the tag is missing, then just set to empty string. And yes, users who instrument their code using jaeger-client can define spanKind when create a span.
It's value comes from span.kind tag. Thinking from schema point of view:
```
service_name | operation_name | span_kind
--------------------------------------------
"serviceA" | "operationA"|"server"
```
Are you proposing we should have below schema?
```
service_name | operation_name | tag
--------------------------------------------
"serviceA" | "operationA"|"span.kind=server"
```
@objectiser currently the UI only need span.kind to get a list of server operations to enrich dropdown list in the UI. Not sure if you also suggest a tag column with possible value to be "{tagkey}={tagvalue}", and in the span writer side maybe define a list of indexableOperationTags, for each of those indexable tag in a span, insert a row in the table. "serviceA" | "operationA"|"{tag1=val1}" "serviceA" | "operationA"|"{tag2=val2}"
span.kind doesn't have a clearly defined restrictions. User can define their kind too, e.g. "local", "test" or whatever. But we have some instrumented frameworks e.g. grpc/tchannel/http which will generate spans for example: client1 --tchannel--> server1:operation1 --tchannel-> server2:oepration2
tchannel clientSpan from client1
tchannel serverSpan from server1
tchannel clientSpan from server1
tchannel serverSpan from server2

objectiser commented 4 years ago

@guo0693 Yes I think it would be better (more general) if was a tag at least (maybe not list for now) - rather than just implementing something for this specific use case.

yurishkuro commented 4 years ago

There are certain arguments in favor of doing this as a general extension with arbitrary tags, but there are also a lot of arguments against that:

Do we know of any real use cases when tags other than spanKind would be useful? Remember that this specific API is primarily used to populate the list of operations for a given service. We don't have any plans to support any search capabilities over that data set
The operationName corresponds to a class of spans (which is why we always insist on it being low cardinality). The span.kind tag belongs to the same category, but many other tags do not, instead they describe a specific instance of that class of spans (e.g. error=true tag).
span.kind is prescribed by the OpenTracing specification to be always recorded, as a way of classifying the spans. Most other tags are not, or they are specific to certain types of spans, like http.* tags. OpenTelemetry goes even further and requires span kind as a span constructor argument (same as span name).
Because span kind is so standardized, it can be included in the index automatically. For any other tag, we would need user-controlled configuration to tell us which tags to include (error=true would never make sense in that index).
Implementing the storage of this data with a list of tags vs. a top-level spanKind field is a lot more complicated. For example, the writers have a dedicated cache used to avoid writing operation names all the time, since it's a fairly static data set. With a list of tags this cache gets more complicated.

Finally, the way the PR itself is shaping, we're extending the operation name in the index from a simple string to a struct. This will allow adding arbitrary tags functionality in the future if the compelling use case is found. My preference would be to wait for such use case to materialize before paying the cost of higher complexity, and only implement support for span kind now as a dedicated field, because of a special role it plays.

burmanm commented 4 years ago

Based on the support tickets for badger backend, a lot of them have been about tags searching. Sadly most of them have included censored tags, so the actual names are not known. But they have included multi-tags searching. Also, why wouldn't error=true not make sense? I think it sounds like a great filtering when you're interested in traces that end up in failure.

To me this ticket sounds like a limitation of Cassandra backend - and that leaking to the generic API feels weird. All the tags are currently indexed in Badger backend and I would imagine in the ES also. If the idea is to limit the search space to only configured list of tags, why not make that instead?

guo0693 commented 4 years ago

Based on the support tickets for badger backend, a lot of them have been about tags searching. Sadly most of them have included censored tags, so the actual names are not known. But they have included multi-tags searching. Also, why wouldn't error=true not make sense? I think it sounds like a great filtering when you're interested in traces that end up in failure.

To me this ticket sounds like a limitation of Cassandra backend - and that leaking to the generic API feels weird. All the tags are currently indexed in Badger backend and I would imagine in the ES also. If the idea is to limit the search space to only configured list of tags, why not make that instead?

To query traces that end up in failure is a valid query, which I believe currently searching traces by tags is already supported.

But this issue is searching operations based on service and spanKind. We have an ask from UI to only return service-operations from server spans so that when user select a service from the DDG page the operations dropdown list only shows server operations.

burmanm commented 4 years ago

Is this more related to the dependency structure then? Or what's the point of searching only the operations without traces?

On Wed, Nov 13, 2019, 23:10 Jun Guo notifications@github.com wrote:

Based on the support tickets for badger backend, a lot of them have been about tags searching. Sadly most of them have included censored tags, so the actual names are not known. But they have included multi-tags searching. Also, why wouldn't error=true not make sense? I think it sounds like a great filtering when you're interested in traces that end up in failure.

To me this ticket sounds like a limitation of Cassandra backend - and that leaking to the generic API feels weird. All the tags are currently indexed in Badger backend and I would imagine in the ES also. If the idea is to limit the search space to only configured list of tags, why not make that instead?

To query traces that end up in failure is a valid query, which I believe currently searching traces by tags is already supported.

But this issue is searching operations based on service and spanKind. We have an ask from UI to only return service-operations from server spans so that when user select a service from the DDG page the operations dropdown list only shows server operations.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jaegertracing/jaeger/issues/1920?email_source=notifications&email_token=AAEJLDVE6IEZUOLA24LJ7D3QTRULPA5CNFSM4JMLOU22YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOED7VEOA#issuecomment-553603640, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEJLDWDGGDYQKHNBDBSIJLQTRULPANCNFSM4JMLOU2Q .

yurishkuro commented 4 years ago

@burmanm I cannot find this ticket now, but it's somewhere in this repo or the UI where people complained

"why do you I see all those weird span names in the Operations dropdown on the Search page? I only want to see the endpoints of my service, not all the internal / client span names".

That's what this feature is fixing - to be able to separate operations into "endpoints" vs. "the rest".

objectiser commented 4 years ago

My main objection was related to API design/consistency and mapping onto the model - service, operation and tags are first class concepts, whereas span.kind is not (i.e. it is a specific type of tag).

However given the transition to OpenTelemetry, and that span kind is a top level span field now (instead of an attribute/tag), I am ok with the approach.

yurishkuro commented 4 years ago

@guo0693 I don't think we addressed one design point, I thought I raised it somewhere in the comments.

Do we really need to support query by kind in the storage? Yes, storage needs to return the spanKind, but why can't it just return ALL operations for a service? If we did not have this requirement, then the gRPC API change could be 100% backwards compatible, the existing storage plugins would return a list of op names in field 1, and after they are upgraded they would start returning the list of Operation structs in a new field 2. And there would be a lot lest implementation changes in the storage engines, only to insert and retrieve an extra field, but not build any additional filtering. The filtering can be done by the QuerySvc.

guo0693 commented 4 years ago

Yes, storage needs to return the spanKind, but why can't it just return ALL operations for a service? If we did no

As of now, cassandra implementation did the filtering in storage layer using cql . If badger & ES does not support filtering in db query, we can put the filtering logic in a util class and retrieve operations of all span.kind from db and apply the filtering

yurishkuro commented 4 years ago

Please add a ticket & coordinate making the UI change to use the new endpoint on the Search screen and display the span kind. I recommend adding it as a prefix, e.g. (server) getUser, and ordering the entries in the dropdown so that server and consumer spans are shown first.

Also add a ticket to remove the old endpoint once the UI is upgraded.

jaegertracing / jaeger

Support optional query param "spanKind" when get operations for given service #1920

Requirement - what kind of business use case are you trying to solve?

Problem - what in Jaeger blocks you from solving the requirement?

Proposal - what do you suggest to solve the problem or improve the existing situation?

Any open questions to address