openzipkin / zipkin-support

repository for support questions raised as issues
4 stars 2 forks source link

Search for multiple tags with values stored in multiple docs would lost results with ElasticSearch #32

Closed edwardxia closed 2 years ago

edwardxia commented 4 years ago

Describe the Bug

I found that brave seems to flush a single span in multiple chunks sometimes, and when using elastic backend, it's stored in elastic as multiple docs (example below). Now the issue is when search using annotationQuery on two different tags: If I search "AAA=foo and BBB=bar", this works and returns me a trace as AAA and BBB are in the same doc. However, if I search for "AAA=foo and DDD=bar", this ends up lost results if tag AAA and DDD are stored in two separate docs, that neither of the two doc would match both conditions.

curl -s 'localhost:9200/zipkin*span-2020-07-22/_search?q=id:191f6b2c6e5288df' | python -m json.tool
{
    "_shards": {
        "failed": 0,
        "skipped": 0,
        "successful": 5,
        "total": 5
    },
    "hits": {
        "hits": [
            {
                "_id": "191f6b2c6e5288df-3f36cc71ab5cce63ac01423a1284d5c8",
                "_index": "zipkin-span-2020-07-22",
                "_score": 15.733055,
                "_source": {
                    "id": "191f6b2c6e5288df",
                    "localEndpoint": {
                        "serviceName": "XXX"
                    },
                    "tags": {
                        "AAA": "foo",
                        "BBB": "bar",
                        "CCC": ""
                    },
                    "traceId": "191f6b2c6e5288df"
                },
                "_type": "_doc"
            },
            {
                "_id": "191f6b2c6e5288df-813619df916f997ac09d3d9da59d14e9",
                "_index": "zipkin-span-2020-07-22",
                "_score": 15.731844,
                "_source": {
                    "id": "191f6b2c6e5288df",
                    "localEndpoint": {
                        "serviceName": "XXX"
                    },
                    "tags": {
                        "DDD": "bar",
                        "EEE": "",
                        "FFF": "",
                        "GGG": "",
                        "HHH": ""
                    },
                    "traceId": "191f6b2c6e5288df"
                },
                "_type": "_doc"
            },
            {
                "_id": "191f6b2c6e5288df-5dc9b8953ac453d865adbdb435b1bce4",
                "_index": "zipkin-span-2020-07-22",
                "_score": 15.221283,
                "_source": {
                    "duration": 15951,
                    "id": "191f6b2c6e5288df",
                    "kind": "SERVER",
                    "localEndpoint": {
                        "serviceName": "XXX"
                    },
                    "name": "post",
                    "remoteEndpoint": {
                        "ipv4": "100.67.8.23"
                    },
                    "tags": {
                        "III": "",
                        "JJJ": "",
                        "KKK": "",
                        "LLL": "",
                        "http.method": "POST",
                        "http.path": ""
                    },
                    "timestamp": 1595401643738406,
                    "timestamp_millis": 1595401643738,
                    "traceId": "191f6b2c6e5288df"
                },
                "_type": "_doc"
            },
            {
                "_id": "191f6b2c6e5288df-1d6988e567d2a7abf8c508c958708e63",
                "_index": "zipkin-span-2020-07-22",
                "_score": 15.221283,
                "_source": {
                    "id": "191f6b2c6e5288df",
                    "localEndpoint": {
                        "serviceName": "XXX"
                    },
                    "tags": {
                        "MMM": "",
                        "OOO": "",
                        "PPP": ""
                    },
                    "traceId": "191f6b2c6e5288df"
                },
                "_type": "_doc"
            }
        ],
        "max_score": 15.733055,
        "total": {
            "relation": "eq",
            "value": 4
        }
    },
    "timed_out": false,
    "took": 1
}

Steps to Reproduce

Find a span that's stored as multiple docs in elastic search and try search two tags with value that are stored in separate docs in annotationQuery, the result will not show up.

Expected Behaviour

Search for tags that are stored in different docs for same span should still return the span.

codefromthecrypt commented 4 years ago

hmm this one is pretty nuanced. ex service name constrains tags. ex tag must be present on the same span as that service iirc. following that, this current behavior could make sense right?

On Thu, Jul 23, 2020, 10:48 AM Edward Xia notifications@github.com wrote:

Describe the Bug

I found that brave seems to flush a single span in multiple chunks sometimes, and when using elastic backend, it's stored in elastic as multiple docs (example below). Now the issue is when search using annotationQuery on two different tags: If I search "AAA=foo and BBB=bar", this works and returns me a trace as AAA and BBB are in the same doc. However, if I search for "AAA=foo and DDD=bar", this ends up lost results if tag AAA and DDD are stored in two separate docs, that neither of the two doc would match both conditions.

curl -s 'localhost:9200/zipkin*span-2020-07-22/_search?q=id:191f6b2c6e5288df' | python -m json.tool

{ "_shards": { "failed": 0, "skipped": 0, "successful": 5, "total": 5 }, "hits": { "hits": [ { "_id": "191f6b2c6e5288df-3f36cc71ab5cce63ac01423a1284d5c8", "_index": "zipkin-span-2020-07-22", "_score": 15.733055, "_source": { "id": "191f6b2c6e5288df", "localEndpoint": { "serviceName": "XXX" }, "tags": { "AAA": "foo", "BBB": "bar", "CCC": "" }, "traceId": "191f6b2c6e5288df" }, "_type": "_doc" }, { "_id": "191f6b2c6e5288df-813619df916f997ac09d3d9da59d14e9", "_index": "zipkin-span-2020-07-22", "_score": 15.731844, "_source": { "id": "191f6b2c6e5288df", "localEndpoint": { "serviceName": "XXX" }, "tags": { "DDD": "bar", "EEE": "", "FFF": "", "GGG": "", "HHH": "" }, "traceId": "191f6b2c6e5288df" }, "_type": "_doc" }, { "_id": "191f6b2c6e5288df-5dc9b8953ac453d865adbdb435b1bce4", "_index": "zipkin-span-2020-07-22", "_score": 15.221283, "_source": { "duration": 15951, "id": "191f6b2c6e5288df", "kind": "SERVER", "localEndpoint": { "serviceName": "XXX" }, "name": "post", "remoteEndpoint": { "ipv4": "100.67.8.23" }, "tags": { "III": "", "JJJ": "", "KKK: "", "LLL": "", "http.method": "POST", "http.path": "" }, "timestamp": 1595401643738406, "timestamp_millis": 1595401643738, "traceId": "191f6b2c6e5288df" }, "_type": "_doc" }, { "_id": "191f6b2c6e5288df-1d6988e567d2a7abf8c508c958708e63", "_index": "zipkin-span-2020-07-22", "_score": 15.221283, "_source": { "id": "191f6b2c6e5288df", "localEndpoint": { "serviceName": "XXX" }, "tags": { "MMM": "", "OOO": "", "PPP": "" }, "traceId": "191f6b2c6e5288df" }, "_type": "_doc" } ], "max_score": 15.733055, "total": { "relation": "eq", "value": 4 } }, "timed_out": false, "took": 1}

Steps to Reproduce

Find a span that's stored as multiple docs in elastic search and try search two tags with value that are stored in separate docs in annotationQuery, the result will not show up. Expected Behaviour

Search for tags that are stored in different docs for same span should still return the span.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/openzipkin/zipkin/issues/3150, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAPVV74KHOR6XEQWP733XTR46QGHANCNFSM4PFJOBSA .

edwardxia commented 4 years ago

@adriancole The issue is all the docs you see in the example is actually one span that all docs has the same spanId. If I look up this spanId in API what I got is a single span that has tags from all those docs combined. - I expect that the search for “A and B” to work on a “span” (which could be multiple docs underneath in ElasticSearch), rather than currently “A and B” only work as A matches one doc, B matches another doc, thus “A and B” matches nothing.

The part I’m not sure is that how it ended up with multiple docs in Elastic Search with same spanId. Other than the tags being modified to hide sensitive content, the example you see is what I got in a real system.

The instrumentation library is brave 5.12 and I’m using http sender. Some of the tags are added using request parser in HTTPTracing, and some tags are added by servlet using currentSpanCustomizer.

edwardxia commented 4 years ago

In other words, maybe spans should not be flushed in multiple chunks? I could imagine the flush behave was designed for performance for many reasons, thus it’s unlikely to be changed after all.

However, having multiple docs for a single span in ElasticSearch really breaks the annotationQuery.

I think one way to fix this issue is to “upsert” the docs in elastic search. I’m not an expert for elastic search and I do not know how hard it is, but regardlessly this will take a performance hit during data injection.

Another way is to change the query implementation to do something equivalent to self join in SQL. Again, I’m not sure how difficult and what’s the performance impact.

codefromthecrypt commented 4 years ago

upsert per span is same as mysql. since chunks are rare this could make sense, just we have to watch out for the ID sharing of client+server

ex name the docs in such a way that upserting doesn't combine them. In general upsert is a little risky though, so I'd default to wait for others to chime in. I wanted first to give you some things to think about, which I think you are.. umm thinking about :D

edwardxia commented 4 years ago

@adriancole You said chunks are rare, but my example is exactly chunks of a single server span, not server + client span sharing id, and such chunking appears to be not so rare for us.

I’m not worried about search across server + client span case which you think are more common, as in our implementation we only add tags in server spans. The problem I’m facing is exactly the “rare” (or not so rare) chunking.

codefromthecrypt commented 4 years ago

when I say chunk I mean for the same operation (ex server side) one document containing most of the information (such as localEndpoint timestamp and duration) and another document only including tags (ex sent after the command finished)

I don't mean one document for the client and another document for the server. indeed that's normal.

As this is becoming a wide open question about behaviours, I'm switching this to the support forum to avoid spamming people. in general, let's use chat to elaborate and of course there are people besides me who can help discuss this. https://gitter.im/openzipkin/zipkin

To move to a point of concrete, you'll want to get a specific trace json and say literally what parts of it (real data) that you expect to act differently.

codefromthecrypt commented 4 years ago

ex the data in your trace looks like buggy use of a tracer as the tags are sent in a separate doc than the rest of the span. By showing the actual client used and real data not dummy tags, this can come to a fix. It is better to not chunk data for no reason.

codefromthecrypt commented 4 years ago

to put in a different way, it is better to fix the client than try to accommodate merge nuance in ES. Not only will this surely make load in ES higher, it will be ambiguous.

For example, even in our UI, to identify is muddy.. ex you can't always tell which side is the client or server when tags are sent in separate chunks (ex are these tags with this spanId supposed to be attached to the client or the server?) here's the code as you can see involves work and some guessing which can easily be wrong. https://github.com/openzipkin/zipkin/blob/master/zipkin-lens/src/zipkin/span-cleaner.js#L220

The best way is to avoid the problem. do not send tags separately from the rest of the span.

edwardxia commented 4 years ago

I’m using brave 5.12 and zipkin-reporter-brave with http sender (okhttp) with default tuning option.

The server span is created by DelegatingTracingFilter.

I cannot show the original tags, but I checked and saw that: Some tags (a few http headers) are added via RequestParser in HTTPTracing used by the Filter, these tags ends up in one doc. Some tags are added with currentSpanCustomizer, and these some how ended up in the other doc that had duration.

I guess the doc with duration seems more likely a flush of a finished span. Now the problem becomes why tagging with RequestParser somehow ended up with a doc being flushed early?

IMO it should be with in normal usage of out of the box functionality. If you think there is a bug in client usage, what might it be? I haven’t look into any details on how spans are buffered and sent, could it be a bug there?

codefromthecrypt commented 4 years ago

yeah try and get and exact trace and we can figure it out especially knowing it is brave. our integration tests for the server filter will show single span. also check if it is braves filter or a separate impl like sleuth as they can help solve the mystery.

On Fri, Jul 24, 2020, 1:22 PM Edward Xia notifications@github.com wrote:

I’m using brave 5.12 and zipkin-reporter-brave with http sender (okhttp) with default tuning option.

The server span is created by DelegatingTracingFilter.

I cannot show the original tags, but I checked and saw that: Some tags (a few http headers) are added via RequestParser in HTTPTracing used by the Filter, these tags ends up in one doc. Some tags are added with currentSpanCustomizer, and these some how ended up in the other doc that had duration.

I guess the doc with duration seems more likely a flush of a finished span. Now the problem becomes why tagging with RequestParser somehow ended up with a doc being flushed early?

IMO it should be with in normal usage of out of the box functionality. If you think there is a bug in client usage, what might it be? I haven’t look into any details on how spans are buffered and sent, could it be a bug there?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/openzipkin/zipkin-support/issues/32#issuecomment-663351812, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAPVV2ZVX4ZBKEMOLIV22DR5EK73ANCNFSM4PGHXHTA .

jorgheymans commented 4 years ago

Does this seem to be what https://github.com/openzipkin/zipkin/issues/1207#issue-168226015 was trying to get at ?

edwardxia commented 4 years ago

It look a long time but the root cause was the default Inheritable ThreadContext messed up in a thread pool, causing all sorts of weird behavior.

In other words, it's not ElasticSearch being weird. It's all brave.

I was able to find it only after finding out a comment in the code: https://github.com/openzipkin/brave/blob/a6b5301e7ffc2132be0791771241a4e670998aac/brave/src/main/java/brave/propagation/CurrentTraceContext.java#L227-L243

@adriancole This definitely should be documented at a more obvious place.

codefromthecrypt commented 4 years ago

sorry about that. where would be more obvious? let's fix it

On Tue, Aug 18, 2020, 1:09 AM Edward Xia notifications@github.com wrote:

It look a long time but the root cause was the default Inheritable ThreadContext messed up in a thread pool, causing all sorts of weird behavior.

I was able to find it only after finding out a comment in the code:

https://github.com/openzipkin/brave/blob/a6b5301e7ffc2132be0791771241a4e670998aac/brave/src/main/java/brave/propagation/CurrentTraceContext.java#L227-L243

@adriancole https://github.com/adriancole This definitely should be documented at a more obvious place.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/openzipkin/zipkin-support/issues/32#issuecomment-675001340, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAPVV2DNAJMRXOZPMQRENTSBFP4ZANCNFSM4PGHXHTA .