Failed to Flush user, file too large

Breee commented 3 years ago

Describe the bug We recently migrated from version 1.6 to 2.0.1 and then to version 2.3.

Loki logs errors like:

level=error ts=2021-10-19T14:31:00.379752655Z caller=flush.go:199 org_id=fake msg="failed to flush user" err="open /data/loki/chunks/ZmFrZS82YWQ3ODRlMWU4N2VlMjMzOjE3Yzk4ZTcwYTdiOjE3Yzk4ZTcwYTdjOmY4YzViZTgw: file too large"
level=error ts=2021-10-19T14:31:00.380491403Z caller=flush.go:199 org_id=fake msg="failed to flush user" err="open /data/loki/chunks/ZmFrZS80YjIxZTdhYmI5ZjY4ZjM1OjE3Yzk4ZGYzZmFmOjE3Yzk4ZGY2N2U5OjVhZGNlOThm: file too large"
level=error ts=2021-10-19T14:31:00.381217484Z caller=flush.go:199 org_id=fake msg="failed to flush user" err="open /data/loki/chunks/ZmFrZS8yNzgyZTk3Y2RiOTczYWEwOjE3Yzk4ZGFiZGY5OjE3Yzk4ZGFiZGZhOjg4NGE0YTJk: file too large"
level=error ts=2021-10-19T14:31:00.382077728Z caller=flush.go:199 org_id=fake msg="failed to flush user" err="open /data/loki/chunks/ZmFrZS9kZmVkYTQwYzk3MjY1YWRhOjE3Yzk4ZWE1NjgwOjE3Yzk4ZWI2YTJkOmUwZmNiMTk2: file too large"
level=error ts=2021-10-19T14:31:00.382935236Z caller=flush.go:199 org_id=fake msg="failed to flush user" err="open /data/loki/chunks/ZmFrZS85ZTlmNjM2OGIyYmJmMTQwOjE3Yzk4ZGNlMWFjOjE3Yzk4ZGNlMWFkOjgzYzIwZWY2: file too large"
level=error ts=2021-10-19T14:31:00.384008079Z caller=flush.go:199 org_id=fake msg="failed to flush user" err="open /data/loki/chunks/ZmFrZS85Yzg3YmVlMWNiYWRhNjQwOjE3Yzk4ZTU4MmYyOjE3Yzk4ZTY5NGUxOjVjZTI2ZDlm: file too large

I wanted to take a look how big these files are, but they do not even exist on disk.

Environment:

Infrastructure: Kubernetes
Version 2.3.0

Config looks as follows:

   auth_enabled: false
    server:
      http_listen_port: 3100

    chunk_store_config:
      max_look_back_period: 0s

    ingester:
      chunk_block_size: 262144
      chunk_idle_period: 3m
      chunk_retain_period: 1m
      lifecycler:
        ring:
          kvstore:
            store: inmemory
          replication_factor: 1

    limits_config:
      enforce_metric_name: false
      reject_old_samples: true
      reject_old_samples_max_age: 168h

    schema_config:
      configs:
      - from: "2018-04-15"
        store: boltdb
        object_store: filesystem
        schema: v9
        index:
          period: 168h
          prefix: index_
      - from: "2021-10-18"
        store: boltdb-shipper
        object_store: filesystem
        schema: v11
        index:
          prefix: index_loki_
          period: 24h

    storage_config:
      boltdb:
        directory: /data/loki/index
      boltdb_shipper:
        active_index_directory: /data/loki/boltdb-shipper-active
        cache_location: /data/loki/boltdb-shipper-cache
        cache_ttl: 24h
        shared_store: filesystem
      filesystem:
        directory: /data/loki/chunks
    table_manager:
      retention_deletes_enabled: false
      retention_period: 0s

    compactor:
      working_directory: /data/loki/boltdb-shipper-compactor
      shared_store: filesystem

    ruler:
      storage:
        type: local
        local:
          directory: /data/loki/rules
      rule_path: /data/loki/rules-temp
      alertmanager_url: http://localhost:9093
      ring:
        kvstore:
          store: inmemory
      enable_api: true

NgHuuAn commented 2 years ago

I got same issue here !

Instead of " file too large ", I got this message " too many open files " and service loki dead.

Have you solved the above error yet?

Reabaln commented 2 years ago

Same issue migrating from 1.6 > 2.0.1 > 2.4.1 @Breee @NgHuuAn Have you solved it?

Breee commented 2 years ago

I don't even know if it has a negative impact. The error is thrown all day.

Maybe the devs can give insight.

mr-karan commented 2 years ago

+1 same error on 2.4.1

ArshamTeymouri commented 2 years ago

+1 same error on 2.4.1

icesri commented 2 years ago

I get

msg="failed to flush user" err=RequestError: send request failed\ncaused by: Put \"https://s3.region.amazonaws.com/loki-sample-await/test/@#@#@@\": EOF" in loki 2.2.0

please advise. Thanks

Breee commented 2 years ago

For us this error disappeared after setting a retention policy, e.g. :

table_manager:
     retention_deletes_enabled: true
     retention_period: 672h

which probably makes sense, because the chunks were deleted.

There still might be a bug or issue here, we don't know if this really fixes anything. But the devs did not yet respond here, so it might be as well no major issue.

stale[bot] commented 2 years ago

Hi! This issue has been automatically marked as stale because it has not had any activity in the past 30 days.

We use a stalebot among other tools to help manage the state of issues in this project. A stalebot can be very useful in closing issues in a number of cases; the most common is closing issues or PRs where the original reporter has not responded.

Stalebots are also emotionless and cruel and can close issues which are still very relevant.

If this issue is important to you, please add a comment to keep it open. More importantly, please add a thumbs-up to the original issue entry.

We regularly sort for closed issues which have a stale label sorted by thumbs up.

We may also:

Mark issues as revivable if we think it's a valid issue but isn't something we are likely to prioritize in the future (the issue will still remain closed).
Add a keepalive label to silence the stalebot if the issue is very common/popular/important.

We are doing our best to respond, organize, and prioritize all issues but it can be a challenging task, our sincere apologies if you find yourself at the mercy of the stalebot.

P6rguVyrst commented 2 years ago

msg="failed to flush user" err=RequestError: send request failed\ncaused by: Put \"https://s3.region.amazonaws.com/loki-sample-await/test/@#@#@@\": EOF"

I seem to be running into the same issue. ☝️

Screenshot 2022-03-04 at 14 29 11

Am I running into some limits that's causing ingestion to fail?

When this happens it brings indexing to a stop, which isn't great..

Screenshot 2022-03-04 at 14 39 30

stale[bot] commented 2 years ago

Hi! This issue has been automatically marked as stale because it has not had any activity in the past 30 days.

We use a stalebot among other tools to help manage the state of issues in this project. A stalebot can be very useful in closing issues in a number of cases; the most common is closing issues or PRs where the original reporter has not responded.

Stalebots are also emotionless and cruel and can close issues which are still very relevant.

If this issue is important to you, please add a comment to keep it open. More importantly, please add a thumbs-up to the original issue entry.

We regularly sort for closed issues which have a stale label sorted by thumbs up.

We may also:

Mark issues as revivable if we think it's a valid issue but isn't something we are likely to prioritize in the future (the issue will still remain closed).
Add a keepalive label to silence the stalebot if the issue is very common/popular/important.

We are doing our best to respond, organize, and prioritize all issues but it can be a challenging task, our sincere apologies if you find yourself at the mercy of the stalebot.

josh-ross-ai commented 2 years ago

Anyone able to resolve or understand the error? I am hitting the same issue.

LinTechSo commented 2 years ago

Hi. any updates ? same issue for me.

grafana / loki

Failed to Flush user, file too large #4497