grafana / loki

Like Prometheus, but for logs.
https://grafana.com/loki
GNU Affero General Public License v3.0
23.66k stars 3.42k forks source link

Loki «loses» the tables in Cassandra after a while #6652

Closed strafer closed 2 years ago

strafer commented 2 years ago

Describe the bug I am preparing a Loki installation for production, using Cassandra as an index store and S3 (non-AWS) as a chunk store.

Loki Config ```yaml auth_enabled: false common: replication_factor: 1 ring: kvstore: store: inmemory limits_config: max_query_lookback: 24h schema_config: configs: - chunks: period: 24h from: '2022-06-04' index: period: 24h prefix: index_ object_store: aws schema: v11 store: cassandra server: grpc_listen_port: 9095 http_listen_port: 3100 storage_config: aws: access_key_id: <…> bucketnames: Logs endpoint: <…> region: <…> s3forcepathstyle: true secret_access_key: <…> cassandra: addresses: 127.0.0.1 auth: true keyspace: loki_index password: <…> username: loki table_manager: retention_deletes_enabled: true retention_period: 24h target: all,table-manager ```
Cassandra Config For the most part, there are default values. ```yaml cluster_name: 'loki' num_tokens: 16 allocate_tokens_for_local_replication_factor: 3 hinted_handoff_enabled: true max_hint_window_in_ms: 10800000 # 3 hours hinted_handoff_throttle_in_kb: 1024 max_hints_delivery_threads: 2 hints_flush_period_in_ms: 10000 max_hints_file_size_in_mb: 128 batchlog_replay_throttle_in_kb: 1024 authenticator: PasswordAuthenticator authorizer: CassandraAuthorizer role_manager: CassandraRoleManager network_authorizer: AllowAllNetworkAuthorizer roles_validity_in_ms: 2000 permissions_validity_in_ms: 2000 credentials_validity_in_ms: 2000 partitioner: org.apache.cassandra.dht.Murmur3Partitioner cdc_enabled: false disk_failure_policy: stop commit_failure_policy: stop prepared_statements_cache_size_mb: key_cache_size_in_mb: key_cache_save_period: 14400 row_cache_size_in_mb: 0 row_cache_save_period: 0 counter_cache_size_in_mb: counter_cache_save_period: 7200 commitlog_sync: periodic commitlog_sync_period_in_ms: 10000 commitlog_segment_size_in_mb: 32 seed_provider: - class_name: org.apache.cassandra.locator.SimpleSeedProvider parameters: - seeds: "<…>" concurrent_reads: 32 concurrent_writes: 32 concurrent_counter_writes: 32 concurrent_materialized_view_writes: 32 memtable_allocation_type: heap_buffers index_summary_capacity_in_mb: index_summary_resize_interval_in_minutes: 60 trickle_fsync: false trickle_fsync_interval_in_kb: 10240 storage_port: 7000 ssl_storage_port: 7001 listen_address: 10.10.80.10 broadcast_address: 10.10.80.10 start_native_transport: true native_transport_port: 9042 native_transport_allow_older_protocols: true rpc_address: 0.0.0.0 broadcast_rpc_address: 10.10.80.10 rpc_keepalive: true incremental_backups: false snapshot_before_compaction: false auto_snapshot: true snapshot_links_per_second: 0 column_index_size_in_kb: 64 column_index_cache_size_in_kb: 2 concurrent_materialized_view_builders: 1 compaction_throughput_mb_per_sec: 64 sstable_preemptive_open_interval_in_mb: 50 read_request_timeout_in_ms: 5000 range_request_timeout_in_ms: 10000 write_request_timeout_in_ms: 2000 counter_write_request_timeout_in_ms: 5000 cas_contention_timeout_in_ms: 1000 truncate_request_timeout_in_ms: 60000 request_timeout_in_ms: 10000 slow_query_log_timeout_in_ms: 500 endpoint_snitch: SimpleSnitch dynamic_snitch_update_interval_in_ms: 100 dynamic_snitch_reset_interval_in_ms: 600000 dynamic_snitch_badness_threshold: 1.0 server_encryption_options: internode_encryption: none enable_legacy_ssl_storage_port: false keystore: conf/.keystore keystore_password: cassandra require_client_auth: false truststore: conf/.truststore truststore_password: cassandra require_endpoint_verification: false client_encryption_options: enabled: false keystore: conf/.keystore keystore_password: cassandra require_client_auth: false internode_compression: dc inter_dc_tcp_nodelay: false tracetype_query_ttl: 86400 tracetype_repair_ttl: 604800 enable_user_defined_functions: false enable_scripted_user_defined_functions: false windows_timer_interval: 1 transparent_data_encryption_options: enabled: false chunk_length_kb: 64 cipher: AES/CBC/PKCS5Padding key_alias: testing:1 key_provider: - class_name: org.apache.cassandra.security.JKSKeyProvider parameters: - keystore: conf/.keystore keystore_password: cassandra store_type: JCEKS key_password: cassandra tombstone_warn_threshold: 1000 tombstone_failure_threshold: 100000 replica_filtering_protection: cached_rows_warn_threshold: 2000 cached_rows_fail_threshold: 32000 batch_size_warn_threshold_in_kb: 5 batch_size_fail_threshold_in_kb: 50 unlogged_batch_across_partitions_warn_threshold: 10 compaction_large_partition_warning_threshold_mb: 100 audit_logging_options: enabled: false logger: - class_name: BinAuditLogger diagnostic_events_enabled: false repaired_data_tracking_for_range_reads_enabled: false repaired_data_tracking_for_partition_reads_enabled: false report_unconfirmed_repaired_data_mismatches: false enable_materialized_views: false enable_sasi_indexes: false enable_transient_replication: false enable_drop_compact_storage: false ```

For a while everything works without complaints, but then I notice the following errors in the logs of Loki himself:

level=error ts=2022-07-11T04:49:54.329267527Z caller=flush.go:146 org_id=fake msg="failed to flush user" err="store put chunk: table index_19172 does not exist"
level=error ts=2022-07-11T04:49:54.339117798Z caller=flush.go:146 org_id=fake msg="failed to flush user" err="store put chunk: table index_19173 does not exist"
level=error ts=2022-07-11T04:49:54.350659343Z caller=flush.go:146 org_id=fake msg="failed to flush user" err="store put chunk: table index_19172 does not exist"
level=error ts=2022-07-11T04:49:54.359466492Z caller=flush.go:146 org_id=fake msg="failed to flush user" err="store put chunk: table index_19172 does not exist"
level=error ts=2022-07-11T04:49:54.375465481Z caller=flush.go:146 org_id=fake msg="failed to flush user" err="store put chunk: table index_19173 does not exist"
level=error ts=2022-07-11T04:49:54.380060507Z caller=flush.go:146 org_id=fake msg="failed to flush user" err="store put chunk: table index_19172 does not exist"
level=error ts=2022-07-11T04:49:54.381992288Z caller=flush.go:146 org_id=fake msg="failed to flush user" err="store put chunk: table index_19172 does not exist"
level=error ts=2022-07-11T04:49:54.382001979Z caller=flush.go:146 org_id=fake msg="failed to flush user" err="store put chunk: table index_19172 does not exist"
level=error ts=2022-07-11T04:49:54.385022939Z caller=flush.go:146 org_id=fake msg="failed to flush user" err="store put chunk: table index_19172 does not exist"
level=error ts=2022-07-11T04:49:54.399914795Z caller=flush.go:146 org_id=fake msg="failed to flush user" err="store put chunk: table index_19172 does not exist"

These errors fall in a continuous stream many times a second. I looked in Cassandra — there really are no specified tables at the time of diagnosis, currently, only the index_19183 and index_19184 tables are present. I stop Loki, delete all tables, delete all chunks and start it from scratch. Everything looks fine again, but after a while the errors in the Loki log come back again.

To Reproduce I do not know exactly how to reproduce this situation, except by using my config to deploy Loki.

Expected behavior No errors in the Loki log related to «lost» tables.

Environment: Loki is running in a podman-managed container using an unmodified official image from docker hub.

Screenshots, Promtail config, or terminal output See above in the description.

liguozhong commented 2 years ago

limits_config: reject_old_samples_max_age: 24h

You need to add a reject_old_samples_max_age conf to prevent you from receiving a table that was deleted by you

retention_period: 24h
strafer commented 2 years ago

@liguozhong thank you! I'll try it.

strafer commented 2 years ago

@liguozhong you were right, since I started Loki from scratch with this parameter, the error has not returned anymore. Thank you very much!

patsevanton commented 1 year ago

@st-rafer Hello! Could you answer: What you version loki? What you distribution? helm-chart or docker-compose? Thanks a lot!

strafer commented 1 year ago

@patsevanton hello, namesake! At the time the issue was created, it was version 2.6.0, running, as written in the first message, by the podman, the container was created by the ansible podman_container module, the operating system is Debian 11.