jaegertracing / jaeger

CNCF Jaeger, a Distributed Tracing Platform
https://www.jaegertracing.io/
Apache License 2.0
20.52k stars 2.44k forks source link

Certain Jaeger tag searches on badger backed storage result in "No trace results. Try another query." despite the entry existing in badger. #2451

Closed Heraclitus closed 4 years ago

Heraclitus commented 4 years ago

VERSION: v1.14.0 EXPECTATION: single trace response in UI for tag search guid:x-request-id=628d36f4-9072-c587-3b00-804c10a63000 image ACTUAL: No trace results. Try another query. SETUP:

  1. To easily get data populated I modified this file plugin/storage/badger/spanstore/writer.go to add tags to any span write that occurred. I added the code below to method func (w *SpanWriter) WriteSpan(span *model.Span) error insert at line 73 (with one import for uuid) based on commit commit fb5505005a21f007792dedbcd2ad49484d1d587e (HEAD, tag: v1.14.0)
    genId,_ := uuid.GenerateUUID()
    runes := []rune(genId)
    span.Tags = append(span.Tags,
        model.KeyValue{Key: "guid:x-request-id", VStr: genId },
        model.KeyValue{Key: "xrequestid", VStr: genId },
        model.KeyValue{Key: "request", VStr: string(runes[1:7]) },
    )

NOTE in some rare cases this change alone was enough to allow search to work, however it was highly inconsistent

  1. the variations xrequestid and request were just experimentation, not essential.
  2. I then ran github.com/jaegertracing/jaeger/cmd/all-in-one with the following env vars to activate badger
    export BADGER_EPHEMERAL="false"
    export SPAN_STORAGE_TYPE="badger"
    export BADGER_DIRECTORY_VALUE="/depot/github/src/github.com/jaegertracing/jaeger/cmd/all-in-one/badger/data"
    export BADGER_DIRECTORY_KEY="/depot/github/src/github.com/jaegertracing/jaeger/cmd/all-in-one/badger/key"
  3. I ran this test cmd/all-in-one/all_in_one_test.go method func TestAllInOne(t *testing.T) (as is; without modification) in order to prime the jaeger instance with data.
  4. I was able to browse and see the traces with forced populated span tags as the code above indicates and the image here shows > PROOF ENTRY EXISTS: image
  5. I then confirmed that searching for that tag did not return results.

CODE CHANGE THAT RETURNS THE TRACE

  1. I did some debugging and found that the results of mergeJoinIds(left, right [][]byte) [][]byte { from file plugin/storage/badger/spanstore/reader.go was an empty array despite both arrays left and right having content.
  2. I found that this switch statement never entered the case 0:
    switch bytes.Compare(left[l], right[r]) {
        case 0:
            // Left matches right - merge
            merged = append(merged, left[l])
            // Advance both
            l++
            r++
        case 1:
            // left > right, increase right one
            r++
        case -1:
            // left < right, increase left one
            l++
        }

    and as a result the merged array was empty.

  3. Inspecting the arrays showed that one of the arrays contained my trace.
  4. I crafted this code (probably not code you want to use) to prove something.
  5. file plugin/storage/badger/spanstore/reader.go and method func sortMergeIds(query *spanstore.TraceQueryParameters, ids [][][]byte) []model.TraceID {
        if len(merged) == 0 {
            if len(ids[0]) == 1{
                merged = append(merged, ids[0][0])
            } else if len(ids[1]) == 1 {
                merged = append(merged, ids[1][0])
            }
        }

    to line 388 Which basically just checks to see if the merged result is empty, if so check for either of the input arrays having nested array length of 1 and append that entry to the merged result.

  6. I can now search for any guid:x-request-id=... value I want and I get back the correct trace.
  7. Example image image

SPECULATION Since I don't understand how those two id arrays are populated nor do I understand why one of them doesn't contain the entry i'm interested in I can't say what the actual fix is, but my guess is that it was expected that the second array would have contained the entry that was in the 1st array and the merge would have collapsed it down and returned a valid entry. Maybe something about the search is failing?

git diff below

diff --git a/plugin/storage/badger/spanstore/reader.go b/plugin/storage/badger/spanstore/reader.go
index 5d0e156..542839e 100644
--- a/plugin/storage/badger/spanstore/reader.go
+++ b/plugin/storage/badger/spanstore/reader.go
@@ -386,6 +386,13 @@ func sortMergeIds(query *spanstore.TraceQueryParameters, ids [][][]byte) []model
                for i := 2; i < len(ids); i++ {
                        merged = mergeJoinIds(merged, ids[i])
                }
+               if len(merged) == 0 {
+                       if len(ids[0]) == 1{
+                               merged = append(merged, ids[0][0])
+                       } else if len(ids[1]) == 1 {
+                               merged = append(merged, ids[1][0])
+                       }
+               }
        } else {
                merged = ids[0]
        }
diff --git a/plugin/storage/badger/spanstore/writer.go b/plugin/storage/badger/spanstore/writer.go
index af204d2..2d63f0a 100644
--- a/plugin/storage/badger/spanstore/writer.go
+++ b/plugin/storage/badger/spanstore/writer.go
@@ -19,6 +19,7 @@ import (
        "encoding/binary"
        "encoding/json"
        "fmt"
+       "github.com/hashicorp/go-uuid"
        "io"
        "time"

@@ -69,6 +70,13 @@ func NewSpanWriter(db *badger.DB, c *CacheStore, ttl time.Duration, storageClose

 // WriteSpan writes the encoded span as well as creates indexes with defined TTL
 func (w *SpanWriter) WriteSpan(span *model.Span) error {
+       genId,_ := uuid.GenerateUUID()
+       runes := []rune(genId)
+       span.Tags = append(span.Tags,
+               model.KeyValue{Key: "guid:x-request-id", VStr: genId },
+               model.KeyValue{Key: "xrequestid", VStr: genId },
+               model.KeyValue{Key: "request", VStr: string(runes[1:7]) },
+       )

        // Avoid doing as much as possible inside the transaction boundary, create entries here
        entriesToStore := make([]*badger.Entry, 0, len(span.Tags)+4+len(span.Process.Tags)+len(span.Logs)*4)
jpkrohling commented 4 years ago

@Heraclitus nice detective work on this one! I'm not quite sure about the fix, as I'd have to dig deeper into the code to better understand it, but perhaps @burmanm still remembers about this and would be interested in taking a look?

burmanm commented 4 years ago

I can take a look. It shouldn't go to the merge if there's a single index request (there's no results to merge).

burmanm commented 4 years ago

@Heraclitus One thing that I didn't ask is that are you sure you didn't filter by serviceName? Because:

3. Inspecting the arrays showed that one of the arrays contained my trace.

If your trace did not match, it should have been in the both arrays. If it's only in the one array, that means one search query never matched your trace. Sadly I can't tell from the screenshots since those parts are cut.

Heraclitus commented 4 years ago

@burmanm on your question about serviceName. The "Find Traces" button is grey'd out unless the service drop down is populated and selecting a service is required. image

Heraclitus commented 4 years ago

@burmanm here is more of the screenshots, three examples, first without tag search, I am required to select a service to show anything and the second with tag search populated. Third, with service name removed and tags, but "Find Traces" is grey'd out so I can't search. This is still running on my modified version for 1.14.0.

image image image

jpkrohling commented 4 years ago

@Heraclitus The service name is the only required field for the search. Based on the screenshots above, I don't see an evidence of a problem: your second screenshot shows that the filter did work as expected, from what I understand.

burmanm commented 4 years ago

Isn't this test pretty close to what's trying to be achieved here? https://github.com/jaegertracing/jaeger/blob/master/plugin/storage/badger/spanstore/read_write_test.go#L669

If you can replicate the issue by modifying that test, then that would be very helpful. Obviously that wouldn't catch if there's a mistake from some refactoring between UI <-> badger-storage-plugin, but at least it would help narrow the search scope.

Heraclitus commented 4 years ago

@jpkrohling those screenshots were taken from my altered version so the search works only because of the changes I made.

Heraclitus commented 4 years ago

@burmanm I'll give that a try today

Heraclitus commented 4 years ago

@burmanm my day job interfered yesterday but what I'm first seeing is that my tag level 1.14.0 of that file doesn't have that test, func TestRandomTraceID(t *testing.T) { Since I'm on 1.14.0 I see https://github.com/jaegertracing/jaeger/blob/v1.14.0/plugin/storage/badger/spanstore/read_write_test.go However, I'm going to try porting that test from master -> v1.14.0. Now this could be a fools errand if the api changed. I'll see if I can manage. Maybe this is something got fixed in a later version past 1.14.0. Will do some investigation.

Heraclitus commented 4 years ago

And, my concern materialized quickly... for file plugin/storage/badger/spanstore/writer.go master func (w *SpanWriter) WriteSpan(ctx context.Context, span *model.Span) error { v1.14.0 func (w *SpanWriter) WriteSpan(span *model.Span) error { https://github.com/jaegertracing/jaeger/blob/v1.14.0/plugin/storage/badger/spanstore/writer.go#L71

... hrm ... I guess what I'll try next is moving to a new version of jeager and see if I can reproduce the error. May take some time.

Heraclitus commented 4 years ago

Turned out; not so hard after all. Stash, then sync to release branch release-1.17 and get sub modules in alignment.

the test in question does in fact exist on that branch and I was able to add the following code

            Tags: model.KeyValues{
                model.KeyValue{
                    Key:   "http.request_id",
                    VStr:  "second",
                    VType: model.StringType,
                },
// made up tags for this bug
                model.KeyValue{Key: "guid:x-request-id", VStr: genId },
                model.KeyValue{Key: "xrequestid", VStr: genId },
                model.KeyValue{Key: "request", VStr: string(runes[1:7])},
            },

then to the assertion portion of the test

               // Try guid search
        params = &spanstore.TraceQueryParameters{
            StartTimeMin: time.Now().Add(-2 * time.Minute),
            StartTimeMax: time.Now(),
            ServiceName:  "nginx",
            Tags: map[string]string{
                "guid:x-request-id": genId,
            },
        }
        traces, err = sr.FindTraces(context.Background(), params)
        assert.NoError(t, err)
        assert.Equal(t, 1, len(traces))

and the result is that the unit tests passes.

Heraclitus commented 4 years ago

The next test was to see if a local running instance from release branch v.1-17 of all_in_one with the obligatory tags added like I did for 1.14.0

func (w *SpanWriter) WriteSpan(span *model.Span) error {
   genId,_ := uuid.GenerateUUID()
   runes := []rune(genId)
   span.Tags = append(span.Tags,
               model.KeyValue{Key: "guid:x-request-id", VStr: genId },
               model.KeyValue{Key: "xrequestid", VStr: genId },
               model.KeyValue{Key: "request", VStr: string(runes[1:7]) },
       )

that ensured that any span write gets the offending tags written.

Then use the local UI to browse to find a generated guid

image

then a search for that guid

image

and confirmation that the trace found matches

image

I'll do some more testing to verify i'm not false-positiving this, but initially it looks like whatever bug existed was fixed after 1.14.0

Heraclitus commented 4 years ago

k, I stashed again, checked out tag v1.15.0, submodule aligned and was able to use the existing badger state files to see previously populated traces both with and without my obligatory tags and I can confirm that v1.15.0 works without any changes to the source.

redacted@redacted-lu:/depot/github/src/github.com/jaegertracing/jaeger$ git status
HEAD detached at v1.15.0
Untracked files:
  (use "git add <file>..." to include in what will be committed)

    cmd/REDACTED... ... a bunch of files I created to help me think about this problem ...
nothing added to commit but untracked files present (use "git add" to track)
Heraclitus commented 4 years ago

Yep, I'm good. This was probably fixed in 1.15.0 here is my test loop in essence...


git checkout v1.14.0
git submodule update --init --recursive
git status 
# stop all_in_one, restart all_in_one
# force browser refresh at root path
# search for guid:x-request-id=86cddff7-2a6a-b17b-5d25-eeb1ff720499 
# NO RESULTS FOUND
git checkout v1.15.0
git submodule update --init --recursive
git status 
# stop all_in_one, restart all_in_one
# force browser refresh at root path
# search for guid:x-request-id=86cddff7-2a6a-b17b-5d25-eeb1ff720499 
# RESULT FOUND
burmanm commented 4 years ago

Good to hear, thanks for the investigation.