grafana / loki

Like Prometheus, but for logs.
https://grafana.com/loki
GNU Affero General Public License v3.0
23.96k stars 3.46k forks source link

Loglines disappear after 5m (when ingester flushes to DB) #1955

Closed holmbaeck closed 4 years ago

holmbaeck commented 4 years ago

Describe the bug I have issue with both 1.4.0 and 1.4.1. For me it seems like an issue with either too long logline, or to large parsed fields.

I can see the lines for 5 minutes. I guess this is when the data is in the ingester before flush to DB. But after 5m the line disappear. There are no errors in the Loki pod logs.

Is this a problem because the "large" json field gets stored in a parsed field, and if so, can we remedy it somehow?

To Reproduce Steps to reproduce the behavior:

  1. Deployed loki stack
  2. Deployed a pod
  3. Create a logline with a large jsonfield
  4. Query loki immediately - look at the nide results
  5. Wait 5m
  6. Query again - wail in despair

Expected behavior Get all log lines

{
    "status": "success",
    "data": {
        "resultType": "streams",
        "result": ["three lines with lots of data"],
        "stats": {
            "summary": {
                "bytesProcessedPerSeconds": 33994148,
                "linesProcessedPerSeconds": 2737,
                "totalBytesProcessed": 37248,
                "totalLinesProcessed": 3,
                "execTime": 0.001095718
            },
            "store": {
                "totalChunksRef": 0,
                "totalChunksDownloaded": 0,
                "chunksDownloadTime": 0,
                "headChunkBytes": 0,
                "headChunkLines": 0,
                "decompressedBytes": 0,
                "decompressedLines": 0,
                "compressedBytes": 0,
                "totalDuplicates": 0
            },
            "ingester": {
                "totalReached": 1,
                "totalChunksMatched": 3,
                "totalBatches": 1,
                "totalLinesSent": 3,
                "headChunkBytes": 37248,
                "headChunkLines": 3,
                "decompressedBytes": 0,
                "decompressedLines": 0,
                "compressedBytes": 0,
                "totalDuplicates": 0
            }
        }
    }
}

call again after 5 minutes still get all loglines -- but we get this

{
    "status": "success",
    "data": {
        "resultType": "streams",
        "result": [],
        "stats": {
            "summary": {
                "bytesProcessedPerSeconds": 0,
                "linesProcessedPerSeconds": 0,
                "totalBytesProcessed": 0,
                "totalLinesProcessed": 0,
                "execTime": 0.001571336
            },
            "store": {
                "totalChunksRef": 3,
                "totalChunksDownloaded": 3,
                "chunksDownloadTime": 0.000347704,
                "headChunkBytes": 0,
                "headChunkLines": 0,
                "decompressedBytes": 0,
                "decompressedLines": 0,
                "compressedBytes": 2865,
                "totalDuplicates": 0
            },
            "ingester": {
                "totalReached": 1,
                "totalChunksMatched": 0,
                "totalBatches": 0,
                "totalLinesSent": 0,
                "headChunkBytes": 0,
                "headChunkLines": 0,
                "decompressedBytes": 0,
                "decompressedLines": 0,
                "compressedBytes": 0,
                "totalDuplicates": 0
            }
        }
    }
}

Environment:

Screenshots, Promtail config, or terminal output Loki-stack values

loki-stack:
  loki:
    enabled: true
    persistence:
      enabled: true
    config:
      ingester:
        max_transfer_retries: 3
      table_manager:
        retention_deletes_enabled: true
        retention_period: 168h

  grafana:
    enabled: true
    admin:
      existingSecret: "grafana"

  prometheus:
    enabled: true
    server:
      statefulSet:
        enabled: true
      persistentVolume:
        enabled: true
    alertmanager:
      persistentVolume:
        enabled: true
    nodeExporter:
      enabled: false
    pushgateway:
      enabled: false
    kubeStateMetrics:
      enabled: false    

  fluent-bit:
    enabled: false

  promtail:
    enabled: true
    ## Remove ability to deploy on master
    tolerations: []
    pipelineStages:
    - docker: {}
    - json:
        expressions:
          app:
          message:
          level:
          logger_name:
          response_code:
          request_body:
          request_path:
          response_body:
          response_time:
          request_method:
          application_name:
          trace_id:
          span_id:
    - template:
        source: jaeger_link
        template: 'https://jaeger.{DOMAIN_TOKEN}/trace/{{ .trace_id }}'
    - labels:
          app:
          level:
          logger_name:
          response_code:
          request_path:
          response_time:
          request_method:
          application_name:     
          trace_id:
          span_id:
          jaeger_link:
    - match:
        selector: '{namespace!="{NAMESPACE_TOKEN}"}'
        action: drop

logs are picked up immediately image image The logline has a 12k json field (debug output), if I reduce data in that field the logline stays after flush image 5 minutes gone - same request - no data image No data :( image

loki-log

level=info ts=2020-04-17T06:59:48.652157744Z caller=metrics.go:83 org_id=fake trace_id=76825d1e93cab114 latency=fast query="{app=\"address-service\"}" query_type=limited range_type=range length=30m1s step=1s duration=1.345012ms status=200 throughput_mb=2.7693433 total_bytes_mb=0.0037248
level=info ts=2020-04-17T07:00:13.056745392Z caller=table_manager.go:247 msg="synching tables" expected_tables=2
level=info ts=2020-04-17T07:02:13.056732943Z caller=table_manager.go:247 msg="synching tables" expected_tables=2
level=info ts=2020-04-17T07:04:13.056702243Z caller=table_manager.go:247 msg="synching tables" expected_tables=2
level=info ts=2020-04-17T07:04:57.82886167Z caller=metrics.go:83 org_id=fake trace_id=49fab37dbcfc3e7c latency=fast query="{app=\"address-service\"}" query_type=limited range_type=range length=30m1s step=1s duration=1.571336ms status=200 throughput_mb=0 total_bytes_mb=0
level=info ts=2020-04-17T07:06:13.056714063Z caller=table_manager.go:247 msg="synching tables" expected_tables=2
stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had any activity in the past 30 days. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

brandonstark commented 4 years ago

Did you ever figure this out? I have the same issue in 1.5

holmbaeck commented 4 years ago

@brandonstark Unfortunately not, we haven't had time to really dig into it. Hopefully we can do something within the next couple of weeks. We are still a few months from production, but if it is not fixed, then fallback will be EFK stack

cyriltovena commented 4 years ago

Have you tried to look at what contains the volume (PVC) ? if it's getting erased somehow ?

holmbaeck commented 4 years ago

@cyriltovena, since the other loglines stay and only log lines with parsed fields larger tan 3-4k gets erased, I do not see how it can be a PVC issue. My guess is that it is a limitation in parsed fields, on the receiving end of the ingester's flush.

cyriltovena commented 4 years ago

Have you tried logcli? It may be related to grafana not handling correctly very long line.

cyriltovena commented 4 years ago

Ho sorry didn’t see the response in the network tab.

holmbaeck commented 4 years ago

@cyriltovena Same result with logcli. Logline is shown for 5m minutes while in the ingester, then disappears. If I change the ingester config to retain chunks for longer, the lines will be searchable in grafana during "chunk_idle_period" time `

      ingester:
        chunk_idle_period: 1h
        max_chunk_age: 1h
        chunk_block_size: 1048576
        chunk_retain_period: 30s
        max_transfer_retries: 0        

Above will make the lines available for 1 hour. I have also bumped to loki v1.5.0, and it has same result

Logcli stats

Ingester.TotalReached            1
Ingester.TotalChunksMatched      3
Ingester.TotalBatches            0
Ingester.TotalLinesSent          0
Ingester.HeadChunkBytes          18 kB
Ingester.HeadChunkLines          3
Ingester.DecompressedBytes       0 B
Ingester.DecompressedLines       0
Ingester.CompressedBytes         0 B
Ingester.TotalDuplicates         0
Store.TotalChunksRef             7
Store.TotalChunksDownloaded      7
Store.ChunksDownloadTime         736.857µs
Store.HeadChunkBytes             0 B
Store.HeadChunkLines             0
Store.DecompressedBytes          14 kB
Store.DecompressedLines          6
Store.CompressedBytes            7.4 kB
Store.TotalDuplicates            0
Summary.BytesProcessedPerSecond          13 MB
Summary.LinesProcessedPerSecond          3679
Summary.TotalBytesProcessed              31 kB
Summary.TotalLinesProcessed              9
Summary.ExecTime                         2.446084ms

--- After 5 min
Ingester.TotalReached            1
Ingester.TotalChunksMatched      3
Ingester.TotalBatches            0
Ingester.TotalLinesSent          0
Ingester.HeadChunkBytes          0 B
Ingester.HeadChunkLines          0
Ingester.DecompressedBytes       1.6 kB
Ingester.DecompressedLines       2
Ingester.CompressedBytes         3.8 kB
Ingester.TotalDuplicates         0
Store.TotalChunksRef             10
Store.TotalChunksDownloaded      10
Store.ChunksDownloadTime         746.251µs
Store.HeadChunkBytes             0 B
Store.HeadChunkLines             0
Store.DecompressedBytes          15 kB
Store.DecompressedLines          8
Store.CompressedBytes            11 kB
Store.TotalDuplicates            2
Summary.BytesProcessedPerSecond          5.7 MB
Summary.LinesProcessedPerSecond          3429
Summary.TotalBytesProcessed              17 kB
Summary.TotalLinesProcessed              10
Summary.ExecTime                         2.915532ms
cyriltovena commented 4 years ago

Ok that’s good, can you activate debug log and check if you have any interesting log when flushing chunks ? Btw SIGTERM will force flush.

cyriltovena commented 4 years ago

Also can you keep only lvl, app and logger as labels the rest is easily searchable with logql and will force a chunk per request which we don’t want.

holmbaeck commented 4 years ago

@cyriltovena Unfortunately there is no interesting logs when doing debug. loki-debug.log

cyriltovena commented 4 years ago

Ok Could you send me a chunk where a line is supposed to be there and you don't see it, which line it is and I'll check if it has been flushed just to narrow down the issue.

We have an inspect chunk too in ./cmd you could use too if you want to push this further. see https://github.com/grafana/loki/tree/master/cmd/chunks-inspect

holmbaeck commented 4 years ago

@cyriltovena nice tool, I found the chunk, so it is getting stored, but doesn't show up in search Chunk attached the-missing-chunk.zip

holmbaeck commented 4 years ago

For reference, these two chunks are created at the same time (during same request) by same service and are searchable. small-1.zip small-2.zip

cyriltovena commented 4 years ago

Thanks for this do you how big the line is ? I'll take a look soon at this chunk and try to reproduce and hopefully get a fixed for it.

holmbaeck commented 4 years ago

@cyriltovena FYI, as per your suggestion I have reduced the amount of labels, and now the "missing" logline can be found

promtail config

pipelineStages:
    - labels:
          app:
          application_name:     
          level:
          logger_name:

removed labels
          response_code:
          request_path:
          response_time:
          request_method:
          application_name:     
          trace_id:
          span_id:
          jaeger_link:
cyriltovena commented 4 years ago

Possibly you were hitting a limit on the number of files from the filesystem. What's weird is that no log,nor errors was showing up that's annoying.

Those labels are bad because they cause each request to create a file with a single line inside. So you endup with millions of tiny tiny chunks. We designed Loki to hold at least 10k to 100k line per chunk.

holmbaeck commented 4 years ago

Since I can find the line in the chunks, the problem must be in the search part of Loki, not storage. Maybe because we make so many tiny chunks, is it being skipped somehow (lines in several closely related chunks being seen as duplicate and therefore discarded?). After removing the unique labels, we have not noticed any lines getting "disappearing". Maybe we can just call this an abnormal use of labels, that is discouraged.

@cyriltovena, thank you for the effort, until I see another "disappearing" line, I will consider it resolved. Hopefully you can find something helpful in the attached chunks.

You mentioned that "rest is easily searchable with logql", I can only se that it can be done via regex, or it is possible to regerence "parsed fields" directly?

cyriltovena commented 4 years ago

Regex and and string contains (|=), Loki excel with those. For parsed fields we have plan for adding this to the language, we have a GrafanaCon talk giving a glimpse about this. See https://grafana.com/go/grafanaconline/loki-future/ near the end.