Jaeger Query can OOM when retrieving large traces

vprithvi commented 6 years ago

Requirement - what kind of business use case are you trying to solve?

Retrieve large traces
Be resilient against bad instrumentation using same traceID for all traces

Problem - what in Jaeger blocks you from solving the requirement?

Jaeger Query OOMs on retrieval of large traces on Cassandra. If someone is crafty, they can easily create a trace with millions of spans, and attempt to retrieve it to systematically bring down all jaeger-query instances.

Proposed Solution - Cassandra

We might do some combination of the following:

Trace retrieval limits: Test that the number of spans per trace is less than a user defined threshold before retrieving spans.
Protect against large spans submitted on the HTTP POST endpoints by setting a user defined span size limit.
Limit number of concurrent requests served by the HTTP GET handler so that we can accurately predict and bound worst case memory utilization.

Any open questions to address

Does this affect ES as well?

pavolloffay commented 6 years ago

Related issue to ES, although its related to overloading ES not query https://github.com/jaegertracing/jaeger/issues/960

jpkrohling commented 6 years ago

And this is why it's a good idea to have the collector in a different instance for production environments :)

zigmund commented 5 years ago

Same here with ES. One failed microservice caused retry loop and we got giant trace with thousands spans. During request jaeger query eats all avaliable memory and get killed by OOM.

jpkrohling commented 5 years ago

We probably want to add this scenario to our test cases

cc @jkandasa

yurishkuro commented 5 years ago

We may consider adding LIMIT to Cassandra queries by trace ID.

pavolloffay commented 5 years ago

Limit on the number of spans in a trace?

yurishkuro commented 5 years ago

Well there are two separate things. We're seeing some traces in production with many millions of spans so we want to implement a limit on ingestion. But separately, there can be a limit on reading side to avoid loading too much even if that many spans were saved.

annanay25 commented 5 years ago

The olivere/elastic library supports sending a range query using From(0) and Size(100) to fetch a predefined number of documents - for ex: https://github.com/olivere/elastic/blob/release-branch.v6/example_test.go#L172

Would it make sense to pass this range as a param in MultiSearch here - https://github.com/jaegertracing/jaeger/blob/master/plugin/storage/es/spanstore/reader.go#L256 ?

pavolloffay commented 5 years ago

Perhaps yes, I am not 100% percent sure how the current search works. Would you like to submit a patch?

annanay25 commented 5 years ago

Sure, I'll give it a shot soon.

annanay25 commented 5 years ago

@pavolloffay could you review this? https://github.com/annanay25/jaeger/commit/2d167b86a14f1a88a6e7ed7cee84399f8403ba88 Reduced the default fetch limit from 10,000 to 1,000. And added a FetchNext function to implement scroll.

yurishkuro commented 5 years ago

I think fetching in batches is not the issue here. If a trace has 1mm spans, whether you fetch in batches or all at once, the query service still needs to store all of them in memory before returning to the client, since we do not support streaming of the results (something to think about when designing the protobuf API for query).

I am not sure what impact using fetching has on the ES itself. At least with Cassandra we're getting a result set that we iterate through, so the driver & storage have the opportunity to load data in a streaming fashion, without Cassandra node holding everything in memory.

annanay25 commented 5 years ago

@yurishkuro the idea was that the query nodes would hold data of only one batch at a time in-memory, and would send this batch to the client (assuming the UI). The client could choose to read and then discard the spans in batches.

But as you have mentioned that jaeger does not support streaming of results, what approach do you recommend?

yurishkuro commented 5 years ago

I think the simplest fix is what is proposed in the description of this ticket. It's not ideal since it will truncate the data, but we can revisit that later when/if we support a streaming API from the query service. At least this fix is fairly straightforward and will protect the query svc from OOM. Most of the analytics will choke on a trace with >100k spans anyway. The limits will be an optional configuration.

zigmund commented 5 years ago

@yurishkuro I agree. Maybe should just mark trace in web interface that it is truncated due span count limit.

annanay25 commented 5 years ago

@yurishkuro truncated search results (truncated span-count as well, for ES) implemented for ES and Cassandra - https://github.com/annanay25/jaeger/commit/9eb7590176383f4f6e95c7194a761b1952f8d024

We were waiting on a test case for this? Is someone working on that?

yurishkuro commented 5 years ago

Not afaik

mohit-chawla commented 4 years ago

Not 100% related but i have a question:

Edge case: For a use case, i need to use in mem storage, was wondering about edge case where one huge trace takes up whole memory, will this cause any OOM errors or will the trace be discard?

do we have a max trace size available ?
how can i increase the mem allocated ?

cc: @annanay25

Sreevani871 commented 3 years ago

Encountering OOM issues while fetching traces from jaeger-UI. Details : Backend Storage: Elasticsearch Instance RAM: 4 GB(~3.7 GB Free Memory) Instance vCPU:2

We are encountering OOM kill upon fetching 20 traces(each trace consists of ~20K spans of 1kb size each) from UI. This is causing frequent process restarts. While going through code, observed jaeger-query service FindTraces API is fetching all spans of a resulted traces in this call itself. Would it be more efficient if it is feasible to fetch only top traceID's and related stats info to display in UI by using ES count queries and all spans of a trace can be queried on click on particular trace. In this case also OOM issue will come if a single trace has large set of spans. To avoid this can we just load default no of spans for a trace initially and providing an option like load more in UI to scroll further set of spans till the trace has no more spans left.

yurishkuro commented 3 years ago

@Sreevani871 yes, we even had tickets somewhere for that, but no volunteers to implement it.

Sreevani871 commented 3 years ago

@yurishkuro Any specific reason for fetching all info in FindTraces call itself from UI?

yurishkuro commented 3 years ago

If you mean the internal implementation of FindTraces, then yes, the reason is because there is no summary information about the full trace available anywhere else, since collectors receive spans one at a time. Storage is the only place that can provide a full trace, but you need to read that full trace in order to extract summary. It's possible to build a post-processing, but it will require additional infrastructure/coordination.

Sreevani871 commented 3 years ago

@Sreevani871 yes, we even had tickets somewhere for that, but no volunteers to implement it.

Can you share the tickets will have a look?

yurishkuro commented 3 years ago

There is context here: https://github.com/jaegertracing/jaeger-ui/issues/247

Sreevani871 commented 3 years ago

@yurishkuro Any progress on this?

Sreevani871 commented 3 years ago

To overcome frequent OOM kills of jaeger-query service (Due to traces having a large number of spans), we need to avoid fetching of all spans data of respective TopN traces. Here listing out the approaches we thought of to solve the issue. we are using elasticsearch as span storage. Please review and share your feedback

Approach - I: Fetch summary of TopN traces instead of fetching all raw span data of respective traces from the backend Returning the summary instead of all raw span data to UI avoids the summary generation step at the UI side which intern saves UI page load time. Changes Required at Backend:

Introduce a new API: /summary to fetch a summary of traces.
Need to add code flow for /summary API which involves changes at querier and span reader interface which will enforce all span storage types to implement summary method at the reader side.
The API returns response as map[TraceID]TraceSummary{}

type TraceSummary  {
    ServiceName string.        //  service name of the old span of trace
    OperationName string    // operation name of the old span of trace
    StartTime string             // start time of the old span of a trace.
    Duration time.Duration // duration of the old span of a trace.
    SpanCount int               // total span count of a trace
    SpanCountByService map[string]int
    ErrorSpanCount int // error span count of a trace
    SpanID string
    }

Here old span of trace refers to parent span ideally. To avoid cases like parent span missing or parent span not indexed at the time fo search, considering fetching old span(sort by startTime) of a trace available at the time of the search.

Code Changes Required at UI:

UI has to call a new API to get the summary of traces data.
Make use of summary data returned from the backend to generate the FindTraces page.
Since UI does not have full span data of a trace, On click on trace, UI should call the backed for full span details.

Pros:

The memory usage for this is approach is very minimal and no.of the objects returned from the backend to UI is equal to no.of traces we fetch and each trace object carries the only summary.
Memory usage is better than Approach-2

Cons:

Requires code changes at both backend and frontend. Compare to Approach-2 UI requires considerable code changes to adopt this approach.

Note: FindTraces page has a Deep Dependency Graph option, Since we are avoiding fetching all span data in the FindTraces API call, Need to have a separate API which will fetch the required fields of spans for traces to construct DDG which will overlap with our idea of Approach-2.

Approach - II: Analyze the required fieldset of span to compute the FindTraces UI page without modifying the existing API response data model. Observed following fieldset of the span are being used to generate the FindTraces list view and DDG.

ServiceName
OperationName
References
Tags.Span.Kind
StartTime
Duration

Changes Required At Backend:

Change the ES query to fetch only selective fields from the source.
The following specifies data flow from ES to UI back.
FindTraces() method returns []jaeger.model.Trace ES -> []byte -> MultiSearchResults{} -> es.dbmodel.Span -> jaeger.model.Span -> jaeger.model.Trace
- ConvertToUI model []jager.model.Trace -> []json.Trace
In the above flow data transformation(serialization/deserialization) steps are consuming 2 to 3X extra memory of actual data. Confirmed from heap profile alloc_space data.
To avoid this extra memory usage we can reduce these steps and will have the following code flow ES ->[]byte -> MultiSearchResults{} -> []json.Trace.

Changes Required At Frontend:

No considerable code changes required at the UI side expect the API call if we introduce a new API for this approach. Pros:
- Only changes required at the backend side. No changes required at the frontend side.
- It will reduce the memory consumption by x percentage(since we are avoiding fetching unnecessary fieldset of a span).

Cons:

It will reduce the memory by x percentage compared to existing code, but no the objects returned from backed to UI not reduced here. Ex: Fetching 20 traces with each trace has 20K spans resulting in 400K objects. Approach-1 is comparatively better in terms of memory usage reduction and the UI page load time since we reduced UI operations of building a summary from data returned from the backend.

yurishkuro commented 3 years ago

Note: FindTraces page has a Deep Dependency Graph option...

That's a very astute observation, since DDG is built entirely by the UI and requires loading all spans of the trace (but not all data in the spans).

A more flexible variation of approach 2 is to build support for GraphQL. It's going to be difficult to support partial field retrieval at the storage level (some storage implementations may only allow retrieving the whole trace as a blob), so the whole data is still going to be loaded in memory from the database. The memory savings can come from refactoring the storage API to be more like stream, i.e. instead of synchronous:

FindTraces(query) ([]*model.Span, error)

implement the streaming version (incidentally, the grpc-plugin storage API already uses streaming under the hood):

FindTracesStreaming(query, handler func(*model.Span) error) error

The handler function can immediately convert model.Span to JSON output object, AND trim it down by filtering only the fields required by GraphQL query.

This will not reduce the number of memory allocations, but will allow query service not to hold onto every single span of the whole result set.

Later this can be further extended into approach 1, which will reduce memory requirements even further (but DDG from search will likely require a second query, which is probably ok since not every search is used to display DDG).

Sreevani871 commented 3 years ago

Using json.Unmarshal() func instead of json.Decode() func here https://github.com/jaegertracing/jaeger/blob/master/plugin/storage/es/spanstore/reader.go#L233 shows improvement on memory usage because json.Decode() func is buffering entire json value in memory again. Since here the data passed to unmarshalJSON() method is already in memory so using json.Unmarshal() is improving performance in terms of memory usage. Benchmarking Results:


goos: darwin
goarch: amd64
BenchmarkUnmarshal-8     2251400               527 ns/op             352 B/op          5 allocs/op
BenchmarkDecoder-8       1470145               812 ns/op            1096 B/op          8 allocs/op


package test

import (
    "bytes"
    "testing"
    "fmt"
    "encoding/json"
)

type JSON struct {
    a []int
    b string
    c float64
    d []byte
}

var j = JSON{
    a: []int{1,2,4,5,566,9},
    b: "Testing Data",
    c: 145.0,
    d: []byte("Testing Data"),
}
func BenchmarkUnmarshal(b *testing.B){
    for n := 0; n < b.N; n++ {
        unmarshalJson(marshalJson(j))
    }
}

func BenchmarkDecoder(b *testing.B){
    for n := 0; n < b.N; n++ {
        decodeJson(marshalJson(j))
    }
}

func marshalJson(j JSON) []byte{
    v,err:=json.Marshal(j)
    if err !=nil {
        fmt.Println(err)
        return nil
    }
    return v
}
func unmarshalJson(data []byte) {
    var j =JSON{}
    err:=json.Unmarshal(data, &j)
    if err!=nil {
        fmt.Println(err)
    }
}

func decodeJson(data []byte) {
    var j = JSON{}
    d := json.NewDecoder(bytes.NewReader(data))
    d.UseNumber()
    if err := d.Decode(&j); err != nil {
        fmt.Println(err)
    }
}

yurishkuro commented 1 year ago

Interesting query benchmark numbers from a post on Slack:

1 trace with approximately 1700 spans returns in 1.5 seconds. (5 unique ES queries (+ 2 ES queries added per any additional trace/limit) all complete in under a second)
10 traces, each with approximately 1700 spans, return in 5 seconds. Memory: 150MB
100 traces, each with approximately 1700 spans, return in 30 seconds. Memory: 700MB - 1.5GB (203 unique ES queries, all of which complete in under 3 seconds)

jkowall commented 3 months ago

One possible solution here would be to limit the number of spans returned when pulling up a trace to 50k and allow this to be configurable.

jaegertracing / jaeger