[receiver/filelog] Filelog receiver missing reading log lines in high throughput scenario

ceevaaa commented 1 week ago

Component(s)

receiver/filelog

What happened?

Description

I am trying to read logs from a log file.

A bit about the log files

almost 250,000 lines are written per minute.
multiple parallel java process writes to the same file.
log files are rotated after they have become 250MB.

Is there a limit on how quickly can filelog receiver read logs from a file ?

A snippet from the log file

2024-09-11-12-51-50:496 [9e0f4f407ed9e754,9e0f4f407ed9e754] [1100941918] INFO  some...amazing...log
2024-09-11-12-51-50:497 [9e0f4f407ed9e754,9e0f4f407ed9e754] [1100941918] INFO  some...amazing...log
2024-09-11-12-51-51:499 [9e0f4f407ed9e754,9e0f4f407ed9e754] [1100941918] INFO  some...amazing...log
2024-09-11-12-51-51:196 [9e0f4f407ed9e754,9e0f4f407ed9e754] [1100941918] INFO  some...amazing...log
2024-09-11-12-51-51:396 [9e0f4f407ed9e754,9e0f4f407ed9e754] [1100941918] INFO  some...amazing...log
2024-09-11-12-51-51:496 [9e0f4f407ed9e754,9e0f4f407ed9e754] [1100941918] INFO  some...amazing...log
2024-09-11-12-51-51:596 [9e0f4f407ed9e754,9e0f4f407ed9e754] [1100941918] INFO  some...amazing...log
2024-09-11-12-51-51:696 [9e0f4f407ed9e754,9e0f4f407ed9e754] [1100941918] INFO  some...amazing...log
2024-09-11-12-51-51:716 [9e0f4f407ed9e754,9e0f4f407ed9e754] [1100941918] INFO  some...amazing...log
2024-09-11-12-51-52:496 [9e0f4f407ed9e754,9e0f4f407ed9e754] [1100941918] INFO  some...amazing...log
.
.
.

Expected Result

All the log lines are read, processed and sent to next hop.

Actual Result

The filelog receiver misses reading lot of log lines. In fact almost 70% of the log lines are missed while reading.

Collector version

v0.104.0

Environment information

Environment

NAME="CentOS Linux" VERSION="7 (Core)" ID="centos" ID_LIKE="rhel fedora" VERSION_ID="7" PRETTY_NAME="CentOS Linux 7 (Core)" ANSI_COLOR="0;31" CPE_NAME="cpe:/o:centos:centos:7" HOME_URL="https://www.centos.org/" BUG_REPORT_URL="https://bugs.centos.org/" CENTOS_MANTISBT_PROJECT="CentOS-7" CENTOS_MANTISBT_PROJECT_VERSION="7" REDHAT_SUPPORT_PRODUCT="centos" REDHAT_SUPPORT_PRODUCT_VERSION="7" CPU: 4vCPU RAM: 8GB

OpenTelemetry Collector configuration

receivers:
  filelog:
    include: [ /home/Application/login.log ]
    include_file_path: true
    include_file_record_number: true
    poll_interval: 10ms
    start_at: end

processors:
  batch:

exporters:
  debug:
    verbosity: normal

service:
  pipelines:
    logs:
      receivers: [filelog]
      processors: [batch]
      exporters: [debug]

Internal Telemetry

# HELP otelcol_exporter_send_failed_log_records Number of log records in failed attempts to send to destination.
# TYPE otelcol_exporter_send_failed_log_records counter
otelcol_exporter_send_failed_log_records{exporter="debug",service_instance_id="97b54447-118c-43e9-95e4-c047b7d4f274",service_name="otelcol-contrib",service_version="0.104.0"} 0
# HELP otelcol_exporter_sent_log_records Number of log record successfully sent to destination.
# TYPE otelcol_exporter_sent_log_records counter
otelcol_exporter_sent_log_records{exporter="debug",service_instance_id="97b54447-118c-43e9-95e4-c047b7d4f274",service_name="otelcol-contrib",service_version="0.104.0"} 48903
# HELP otelcol_fileconsumer_open_files Number of open files
# TYPE otelcol_fileconsumer_open_files gauge
otelcol_fileconsumer_open_files{service_instance_id="97b54447-118c-43e9-95e4-c047b7d4f274",service_name="otelcol-contrib",service_version="0.104.0"} 1
# HELP otelcol_fileconsumer_reading_files Number of open files that are being read
# TYPE otelcol_fileconsumer_reading_files gauge
otelcol_fileconsumer_reading_files{service_instance_id="97b54447-118c-43e9-95e4-c047b7d4f274",service_name="otelcol-contrib",service_version="0.104.0"} 0
# HELP otelcol_process_cpu_seconds Total CPU user and system time in seconds
# TYPE otelcol_process_cpu_seconds counter
otelcol_process_cpu_seconds{service_instance_id="97b54447-118c-43e9-95e4-c047b7d4f274",service_name="otelcol-contrib",service_version="0.104.0"} 4.16
# HELP otelcol_process_memory_rss Total physical memory (resident set size)
# TYPE otelcol_process_memory_rss gauge
otelcol_process_memory_rss{service_instance_id="97b54447-118c-43e9-95e4-c047b7d4f274",service_name="otelcol-contrib",service_version="0.104.0"} 1.18206464e+08
# HELP otelcol_process_runtime_heap_alloc_bytes Bytes of allocated heap objects (see 'go doc runtime.MemStats.HeapAlloc')
# TYPE otelcol_process_runtime_heap_alloc_bytes gauge
otelcol_process_runtime_heap_alloc_bytes{service_instance_id="97b54447-118c-43e9-95e4-c047b7d4f274",service_name="otelcol-contrib",service_version="0.104.0"} 6.224556e+07
# HELP otelcol_process_runtime_total_alloc_bytes Cumulative bytes allocated for heap objects (see 'go doc runtime.MemStats.TotalAlloc')
# TYPE otelcol_process_runtime_total_alloc_bytes counter
otelcol_process_runtime_total_alloc_bytes{service_instance_id="97b54447-118c-43e9-95e4-c047b7d4f274",service_name="otelcol-contrib",service_version="0.104.0"} 1.243345416e+09
# HELP otelcol_process_runtime_total_sys_memory_bytes Total bytes of memory obtained from the OS (see 'go doc runtime.MemStats.Sys')
# TYPE otelcol_process_runtime_total_sys_memory_bytes gauge
otelcol_process_runtime_total_sys_memory_bytes{service_instance_id="97b54447-118c-43e9-95e4-c047b7d4f274",service_name="otelcol-contrib",service_version="0.104.0"} 8.7384344e+07
# HELP otelcol_process_uptime Uptime of the process
# TYPE otelcol_process_uptime counter
otelcol_process_uptime{service_instance_id="97b54447-118c-43e9-95e4-c047b7d4f274",service_name="otelcol-contrib",service_version="0.104.0"} 68.290974471
# HELP otelcol_processor_batch_batch_send_size Number of units in the batch
# TYPE otelcol_processor_batch_batch_send_size histogram
otelcol_processor_batch_batch_send_size_bucket{processor="batch",service_instance_id="97b54447-118c-43e9-95e4-c047b7d4f274",service_name="otelcol-contrib",service_version="0.104.0",le="10"} 8
otelcol_processor_batch_batch_send_size_bucket{processor="batch",service_instance_id="97b54447-118c-43e9-95e4-c047b7d4f274",service_name="otelcol-contrib",service_version="0.104.0",le="25"} 22
otelcol_processor_batch_batch_send_size_bucket{processor="batch",service_instance_id="97b54447-118c-43e9-95e4-c047b7d4f274",service_name="otelcol-contrib",service_version="0.104.0",le="50"} 40
otelcol_processor_batch_batch_send_size_bucket{processor="batch",service_instance_id="97b54447-118c-43e9-95e4-c047b7d4f274",service_name="otelcol-contrib",service_version="0.104.0",le="75"} 64
otelcol_processor_batch_batch_send_size_bucket{processor="batch",service_instance_id="97b54447-118c-43e9-95e4-c047b7d4f274",service_name="otelcol-contrib",service_version="0.104.0",le="100"} 117
otelcol_processor_batch_batch_send_size_bucket{processor="batch",service_instance_id="97b54447-118c-43e9-95e4-c047b7d4f274",service_name="otelcol-contrib",service_version="0.104.0",le="250"} 281
otelcol_processor_batch_batch_send_size_bucket{processor="batch",service_instance_id="97b54447-118c-43e9-95e4-c047b7d4f274",service_name="otelcol-contrib",service_version="0.104.0",le="500"} 325
otelcol_processor_batch_batch_send_size_bucket{processor="batch",service_instance_id="97b54447-118c-43e9-95e4-c047b7d4f274",service_name="otelcol-contrib",service_version="0.104.0",le="750"} 325
otelcol_processor_batch_batch_send_size_bucket{processor="batch",service_instance_id="97b54447-118c-43e9-95e4-c047b7d4f274",service_name="otelcol-contrib",service_version="0.104.0",le="1000"} 325
otelcol_processor_batch_batch_send_size_bucket{processor="batch",service_instance_id="97b54447-118c-43e9-95e4-c047b7d4f274",service_name="otelcol-contrib",service_version="0.104.0",le="2000"} 325
otelcol_processor_batch_batch_send_size_bucket{processor="batch",service_instance_id="97b54447-118c-43e9-95e4-c047b7d4f274",service_name="otelcol-contrib",service_version="0.104.0",le="3000"} 325
otelcol_processor_batch_batch_send_size_bucket{processor="batch",service_instance_id="97b54447-118c-43e9-95e4-c047b7d4f274",service_name="otelcol-contrib",service_version="0.104.0",le="4000"} 325
otelcol_processor_batch_batch_send_size_bucket{processor="batch",service_instance_id="97b54447-118c-43e9-95e4-c047b7d4f274",service_name="otelcol-contrib",service_version="0.104.0",le="5000"} 325
otelcol_processor_batch_batch_send_size_bucket{processor="batch",service_instance_id="97b54447-118c-43e9-95e4-c047b7d4f274",service_name="otelcol-contrib",service_version="0.104.0",le="6000"} 325
otelcol_processor_batch_batch_send_size_bucket{processor="batch",service_instance_id="97b54447-118c-43e9-95e4-c047b7d4f274",service_name="otelcol-contrib",service_version="0.104.0",le="7000"} 325
otelcol_processor_batch_batch_send_size_bucket{processor="batch",service_instance_id="97b54447-118c-43e9-95e4-c047b7d4f274",service_name="otelcol-contrib",service_version="0.104.0",le="8000"} 325
otelcol_processor_batch_batch_send_size_bucket{processor="batch",service_instance_id="97b54447-118c-43e9-95e4-c047b7d4f274",service_name="otelcol-contrib",service_version="0.104.0",le="9000"} 325
otelcol_processor_batch_batch_send_size_bucket{processor="batch",service_instance_id="97b54447-118c-43e9-95e4-c047b7d4f274",service_name="otelcol-contrib",service_version="0.104.0",le="10000"} 325
otelcol_processor_batch_batch_send_size_bucket{processor="batch",service_instance_id="97b54447-118c-43e9-95e4-c047b7d4f274",service_name="otelcol-contrib",service_version="0.104.0",le="20000"} 325
otelcol_processor_batch_batch_send_size_bucket{processor="batch",service_instance_id="97b54447-118c-43e9-95e4-c047b7d4f274",service_name="otelcol-contrib",service_version="0.104.0",le="30000"} 325
otelcol_processor_batch_batch_send_size_bucket{processor="batch",service_instance_id="97b54447-118c-43e9-95e4-c047b7d4f274",service_name="otelcol-contrib",service_version="0.104.0",le="50000"} 325
otelcol_processor_batch_batch_send_size_bucket{processor="batch",service_instance_id="97b54447-118c-43e9-95e4-c047b7d4f274",service_name="otelcol-contrib",service_version="0.104.0",le="100000"} 325
otelcol_processor_batch_batch_send_size_bucket{processor="batch",service_instance_id="97b54447-118c-43e9-95e4-c047b7d4f274",service_name="otelcol-contrib",service_version="0.104.0",le="+Inf"} 325
otelcol_processor_batch_batch_send_size_sum{processor="batch",service_instance_id="97b54447-118c-43e9-95e4-c047b7d4f274",service_name="otelcol-contrib",service_version="0.104.0"} 48903
otelcol_processor_batch_batch_send_size_count{processor="batch",service_instance_id="97b54447-118c-43e9-95e4-c047b7d4f274",service_name="otelcol-contrib",service_version="0.104.0"} 325
# HELP otelcol_processor_batch_metadata_cardinality Number of distinct metadata value combinations being processed
# TYPE otelcol_processor_batch_metadata_cardinality gauge
otelcol_processor_batch_metadata_cardinality{processor="batch",service_instance_id="97b54447-118c-43e9-95e4-c047b7d4f274",service_name="otelcol-contrib",service_version="0.104.0"} 1
# HELP otelcol_processor_batch_timeout_trigger_send Number of times the batch was sent due to a timeout trigger
# TYPE otelcol_processor_batch_timeout_trigger_send counter
otelcol_processor_batch_timeout_trigger_send{processor="batch",service_instance_id="97b54447-118c-43e9-95e4-c047b7d4f274",service_name="otelcol-contrib",service_version="0.104.0"} 325
# HELP otelcol_receiver_accepted_log_records Number of log records successfully pushed into the pipeline.
# TYPE otelcol_receiver_accepted_log_records counter
otelcol_receiver_accepted_log_records{receiver="filelog",service_instance_id="97b54447-118c-43e9-95e4-c047b7d4f274",service_name="otelcol-contrib",service_version="0.104.0",transport=""} 48931
# HELP otelcol_receiver_refused_log_records Number of log records that could not be pushed into the pipeline.
# TYPE otelcol_receiver_refused_log_records counter
otelcol_receiver_refused_log_records{receiver="filelog",service_instance_id="97b54447-118c-43e9-95e4-c047b7d4f274",service_name="otelcol-contrib",service_version="0.104.0",transport=""} 70614
# HELP target_info Target metadata
# TYPE target_info gauge
target_info{service_instance_id="97b54447-118c-43e9-95e4-c047b7d4f274",service_name="otelcol-contrib",service_version="0.104.0"} 1

In the below image the log receiver skips tons of lines.

github-actions[bot] commented 1 week ago

Pinging code owners:

receiver/filelog: @djaglowski

See Adding Labels via Comments if you do not have permissions to add labels yourself.

djaglowski commented 1 week ago

There are some circumstances where the receiver would not be able to keep up with log rotation. Specifically, if the files are being rotated to a location where they are not found in subsequent polling intervals, or if the log files are deleted before the receiver has had a chance to consume them (due to a rotation configuration that defines the max number of backups).

It sounds like you have a high throughput scenario so it seems possible. Are the rotated files being deleted eventually, and if so, can we establish in some way the frequency? If you can provide an estimate of the max number of backups and the average log size in bytes, then I think we can work out whether or not such a limitation is in play.

The other factor which I'm wondering about is whether the debug exporter is too slow to keep up, and may in fact be causing backpressure on the receiver which unnecessarily delays reading the logs. You could consider using the count connector to assess this, because it should be much more efficient way to count the logs. e.g.

recievers:
  filelog:
    include: ...

connectors:
  count:

processors:
  deltatocumulative:

exporters:
  debug:

service:
  pipelines:
    logs/in:
      receivers: [ filelog ]
      exporters: [ count ]
    metrics/out:
      receivers: [ count ]
      processors: [ deltatocumulative ] # aggregate "+1"s into cumulative count
      exporters: [ debug ] # only print the counts, rather than the logs themselves

ceevaaa commented 1 week ago

Let me share a little more around the directory and how the log rotation is being handled.

Below is the directory structure.

├── Application
│   ├── login.log
│   ├── main.java
│   ├── archive
│   │   ├── 2024-09-11.login.01.log
│   │   ├── 2024-09-11.login.02.log
│   │   ├── 2024-09-11.login.03.log
│   │   ├── .....

The logs are continuously written to login.log by the main.java. Once the file exceeds 250MB in size, it is moved to the archive folder. And a new file is created replacing the old one.

Here is the snapshot that describes how quickly login.log files are archived. These vary from 1m, 2m and so on.

For handling the case where debug exporter was too slow

The config you shared with deltatocumulative didn't work. Maybe I am missing something here.

Sep 12 12:43:19 ip-10-23-180-96.ap-south-1.compute.internal systemd[1]: Started OpenTelemetry Collector Contrib.
Sep 12 12:43:19 ip-10-23-180-96.ap-south-1.compute.internal otelcol-contrib[28494]: 2024/09/12 12:43:19 collector server run finished with error: failed to get config: cannot unmarshal the configuration: 1 error(s) decoding:
Sep 12 12:43:19 ip-10-23-180-96.ap-south-1.compute.internal otelcol-contrib[28494]: * error decoding 'processors': unknown type: "deltatocumulative" for id: "deltatocumulative" (valid values: [resourcedetection batch cumulativetodelta k8sattributes groupbyattrs experimental_metricsgeneration sumologic tail_sampling memory_limiter deltatorate filter redaction span transform attributes groupbytrace probabilistic_sampler routing metricstransform remotetap resource])
Sep 12 12:43:19 ip-10-23-180-96.ap-south-1.compute.internal systemd[1]: otelcol-contrib.service: main process exited, code=exited, status=1/FAILURE
Sep 12 12:43:19 ip-10-23-180-96.ap-south-1.compute.internal systemd[1]: Unit otelcol-contrib.service entered failed state.
Sep 12 12:43:19 ip-10-23-180-96.ap-south-1.compute.internal systemd[1]: otelcol-contrib.service failed.
Sep 12 12:43:20 ip-10-23-180-96.ap-south-1.compute.internal systemd[1]: otelcol-contrib.service holdoff time over, scheduling restart.
Sep 12 12:43:20 ip-10-23-180-96.ap-south-1.compute.internal systemd[1]: Stopped OpenTelemetry Collector Contrib.

I used it without the deltatocumulative processor and this is the snippet from the result. The count never went more than 100 (is this intentional ?).

Sep 12 14:01:43 : 2024-09-12T14:01:43.114+0530        info        log.record.count{} 100
Sep 12 14:01:43 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:43 : 2024-09-12T14:01:43.146+0530        info        log.record.count{} 100
Sep 12 14:01:43 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:43 : 2024-09-12T14:01:43.177+0530        info        log.record.count{} 100
Sep 12 14:01:43 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:43 : 2024-09-12T14:01:43.183+0530        info        log.record.count{} 9
Sep 12 14:01:43 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:43 : 2024-09-12T14:01:43.241+0530        info        log.record.count{} 100
Sep 12 14:01:43 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:43 : 2024-09-12T14:01:43.284+0530        info        log.record.count{} 65
Sep 12 14:01:43 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:43 : 2024-09-12T14:01:43.335+0530        info        log.record.count{} 100
Sep 12 14:01:43 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:43 : 2024-09-12T14:01:43.365+0530        info        log.record.count{} 100
Sep 12 14:01:43 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:43 : 2024-09-12T14:01:43.383+0530        info        log.record.count{} 27
Sep 12 14:01:43 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:43 : 2024-09-12T14:01:43.469+0530        info        log.record.count{} 100
Sep 12 14:01:43 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:43 : 2024-09-12T14:01:43.483+0530        info        log.record.count{} 37
Sep 12 14:01:43 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:43 : 2024-09-12T14:01:43.510+0530        info        log.record.count{} 100
Sep 12 14:01:43 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:43 : 2024-09-12T14:01:43.520+0530        info        log.record.count{} 100
Sep 12 14:01:43 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:43 : 2024-09-12T14:01:43.561+0530        info        log.record.count{} 100
Sep 12 14:01:43 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:43 : 2024-09-12T14:01:43.584+0530        info        log.record.count{} 53
Sep 12 14:01:43 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:43 : 2024-09-12T14:01:43.683+0530        info        log.record.count{} 13
Sep 12 14:01:43 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:43 : 2024-09-12T14:01:43.783+0530        info        log.record.count{} 17
Sep 12 14:01:43 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:43 : 2024-09-12T14:01:43.883+0530        info        log.record.count{} 41
Sep 12 14:01:43 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:43 : 2024-09-12T14:01:43.983+0530        info        log.record.count{} 69
Sep 12 14:01:43 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:44 : 2024-09-12T14:01:44.083+0530        info        log.record.count{} 81
Sep 12 14:01:44 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:44 : 2024-09-12T14:01:44.183+0530        info        log.record.count{} 45
Sep 12 14:01:44 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:44 : 2024-09-12T14:01:44.283+0530        info        log.record.count{} 55
Sep 12 14:01:44 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:44 : 2024-09-12T14:01:44.383+0530        info        log.record.count{} 55
Sep 12 14:01:44 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:44 : 2024-09-12T14:01:44.484+0530        info        log.record.count{} 6
Sep 12 14:01:44 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:44 : 2024-09-12T14:01:44.583+0530        info        log.record.count{} 88
Sep 12 14:01:44 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:44 : 2024-09-12T14:01:44.683+0530        info        log.record.count{} 3
Sep 12 14:01:44 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:44 : 2024-09-12T14:01:44.783+0530        info        log.record.count{} 88
Sep 12 14:01:44 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:44 : 2024-09-12T14:01:44.883+0530        info        log.record.count{} 57
Sep 12 14:01:44 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:44 : 2024-09-12T14:01:44.983+0530        info        log.record.count{} 25
Sep 12 14:01:44 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:45 : 2024-09-12T14:01:45.084+0530        info        log.record.count{} 87
Sep 12 14:01:45 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:45 : 2024-09-12T14:01:45.183+0530        info        log.record.count{} 20
Sep 12 14:01:45 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:45 : 2024-09-12T14:01:45.283+0530        info        log.record.count{} 6
Sep 12 14:01:45 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:45 : 2024-09-12T14:01:45.383+0530        info        log.record.count{} 4
Sep 12 14:01:45 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:45 : 2024-09-12T14:01:45.484+0530        info        log.record.count{} 85
Sep 12 14:01:45 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:45 : 2024-09-12T14:01:45.583+0530        info        log.record.count{} 66
Sep 12 14:01:45 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:45 : 2024-09-12T14:01:45.683+0530        info        log.record.count{} 42
Sep 12 14:01:45 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:45 : 2024-09-12T14:01:45.783+0530        info        log.record.count{} 55
Sep 12 14:01:45 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:45 : 2024-09-12T14:01:45.883+0530        info        log.record.count{} 45
Sep 12 14:01:45 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:45 : 2024-09-12T14:01:45.983+0530        info        log.record.count{} 36
Sep 12 14:01:45 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:46 : 2024-09-12T14:01:46.084+0530        info        log.record.count{} 15
Sep 12 14:01:46 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:46 : 2024-09-12T14:01:46.183+0530        info        log.record.count{} 25
Sep 12 14:01:46 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:46 : 2024-09-12T14:01:46.284+0530        info        log.record.count{} 71
Sep 12 14:01:46 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:46 : 2024-09-12T14:01:46.383+0530        info        log.record.count{} 9
Sep 12 14:01:46 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:46 : 2024-09-12T14:01:46.484+0530        info        log.record.count{} 13
Sep 12 14:01:46 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:46 : 2024-09-12T14:01:46.583+0530        info        MetricsExporter        {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 1, "data points": 1}
Sep 12 14:01:46 : 2024-09-12T14:01:46.583+0530        info        log.record.count{} 91
Sep 12 14:01:46 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:46 : 2024-09-12T14:01:46.683+0530        info        log.record.count{} 3
Sep 12 14:01:46 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:46 : 2024-09-12T14:01:46.783+0530        info        log.record.count{} 29
Sep 12 14:01:46 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:46 : 2024-09-12T14:01:46.883+0530        info        log.record.count{} 63
Sep 12 14:01:46 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:46 : 2024-09-12T14:01:46.983+0530        info        log.record.count{} 33
Sep 12 14:01:46 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:47 : 2024-09-12T14:01:47.083+0530        info        log.record.count{} 22
Sep 12 14:01:47 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:47 : 2024-09-12T14:01:47.183+0530        info        log.record.count{} 70
Sep 12 14:01:47 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:47 : 2024-09-12T14:01:47.283+0530        info        log.record.count{} 3
Sep 12 14:01:47 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:47 : 2024-09-12T14:01:47.383+0530        info        log.record.count{} 38
Sep 12 14:01:47 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:47 : 2024-09-12T14:01:47.483+0530        info        log.record.count{} 35
Sep 12 14:01:47 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:47 : 2024-09-12T14:01:47.583+0530        info        log.record.count{} 79
Sep 12 14:01:47 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:47 : 2024-09-12T14:01:47.683+0530        info        log.record.count{} 35
Sep 12 14:01:47 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:47 : 2024-09-12T14:01:47.783+0530        info        log.record.count{} 72
Sep 12 14:01:47 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:47 : 2024-09-12T14:01:47.884+0530        info        log.record.count{} 94
Sep 12 14:01:47 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:47 : 2024-09-12T14:01:47.983+0530        info        log.record.count{} 96
Sep 12 14:01:47 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:48 : 2024-09-12T14:01:48.083+0530        info        log.record.count{} 47
Sep 12 14:01:48 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:48 : 2024-09-12T14:01:48.183+0530        info        log.record.count{} 5
Sep 12 14:01:48 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:48 : 2024-09-12T14:01:48.284+0530        info        log.record.count{} 78
Sep 12 14:01:48 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:48 : 2024-09-12T14:01:48.383+0530        info        log.record.count{} 52
Sep 12 14:01:48 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:48 : 2024-09-12T14:01:48.484+0530        info        log.record.count{} 39
Sep 12 14:01:48 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:48 : 2024-09-12T14:01:48.583+0530        info        log.record.count{} 72
Sep 12 14:01:48 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:48 : 2024-09-12T14:01:48.783+0530        info        log.record.count{} 57
Sep 12 14:01:48 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:48 : 2024-09-12T14:01:48.883+0530        info        log.record.count{} 13
Sep 12 14:01:48 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:48 : 2024-09-12T14:01:48.983+0530        info        log.record.count{} 41
Sep 12 14:01:48 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:49 : 2024-09-12T14:01:49.083+0530        info        log.record.count{} 43
Sep 12 14:01:49 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:49 : 2024-09-12T14:01:49.155+0530        info        log.record.count{} 100
Sep 12 14:01:49 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:49 : 2024-09-12T14:01:49.183+0530        info        log.record.count{} 25
Sep 12 14:01:49 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:49 : 2024-09-12T14:01:49.283+0530        info        log.record.count{} 96
Sep 12 14:01:49 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:49 : 2024-09-12T14:01:49.383+0530        info        log.record.count{} 19
Sep 12 14:01:49 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:49 : 2024-09-12T14:01:49.483+0530        info        log.record.count{} 28
Sep 12 14:01:49 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:49 : 2024-09-12T14:01:49.583+0530        info        log.record.count{} 57

Since the deltatocumulative didn't work, currently I don't have an answer to whether the logs are dropped or not.

Open questions/ideas @djaglowski

Will decreasing the poll_interval help ? We tried decreasing it, but still lost logs. I checked this by summing the output of count from the debug and verifying it with actual log count for a time range (did for a range of just one second).
What is the general solution in these cases ? I am sure the same filelog receiver is being used in even higher throughput scenarios.

VihasMakwana commented 1 week ago

Will decreasing the poll_interval help ? We tried decreasing it, but still lost logs. I checked this by summing the output of count from the debug and verifying it with actual log count for a time range (did for a range of just one second).

@djaglowski It seems like this issue might be related to this issue. What do you say? Reducing the polling interval can be beneficial, but only up to a certain point.

@ceevaaa i think you can utilize the fileexporter in this case to gather all the logs and verify the count.

djaglowski commented 1 week ago

Once the file exceeds 250MB in size, it is moved to the archive folder. And a new file is created replacing the old one.

Just to confirm, the file is definitely moved and /home/Application/login.log is recreated, as opposed to being copied to the archive and then truncated? I'm asking because if it's the latter then I would not be surprised by the loss.

I used it without the deltatocumulative processor and this is the snippet from the result. The count never went more than 100 (is this intentional ?).

I'm not sure the status of the deltatocumulative processor and only meant to provide the general idea, so the config may not be valid. There may be a way to make it work, but it sounds like you've found a reasonable way to get the total.

The count maxing at 100 is a consequence to batches being emitted by the filelog receiver once they reach 100 logs.

@ceevaaa i think you can utilize the fileexporter in this case to gather all the logs and verify the count.

I agree with this. If you're not using the deltatocumulative, then dumping counts into a file like this is a better way to be sure your count is accurate.

Will decreasing the poll_interval help ?

Only if the receiver is able to keep up with ingestion. For example, if you set the poll interval to 10ns but it takes 100ms to read each time, then you're essentially just telling it to start a new poll as soon as the previous one finishes.

All that said, it sounds like there may be a real issue here. I'd like to reiterate my request for a decent estimate of the average log size. This would help us establish what the behavior should look like.

We know:

almost 250,000 lines are written per minute
log files are rotated after they have become 250MB

If we know the average log size, we can compute how often the file should be rotating. This may or may not lead to an explanation but I'd appreciate if we can start there so I can understand the scenario better.

@djaglowski It seems like this issue might be related to https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/17846. What do you say?

@VihasMakwana I don't see the connection. That issue requires there to be multiple separate files. This issue is about a single file written to by multiple processes.

ceevaaa commented 1 week ago

Here are the updated details about the logs (average taken over 5 different log files) @djaglowski.

size of log file ~ 250MB
avg no of log lines ~ 800,987
avg size of each log line ~ 325 bytes
~ 250,000 log lines written per min
rotated after size >= 250MB

Just to confirm, the file is definitely moved and /home/Application/login.log is recreated, as opposed to being copied to the archive and then truncated? I'm asking because if it's the latter then I would not be surprised by the loss.

I have asked the team to confirm this. For reference, we are using log4j2.

@ceevaaa i think you can utilize the fileexporter in this case to gather all the logs and verify the count.

I used the fileexporter and it worked much better then debug. Thanks @VihasMakwana.

Update as on Sep 16

We set the poll_interval to 1us and used the fileexporter to verify the logs. AND IT WORKED.

At least the limited testing we could do today (manually picking random client_id and cross verifying, randomly picking other attributes as well, greping the log.file.record_number to see if the line number matches without skipping any lines).

I will continue to test and update you guys here.

PS: Keeping it open for some more time, till we get everything stabilized.

VihasMakwana commented 1 week ago

@VihasMakwana I don't see the connection. That issue requires there to be multiple separate files. This issue is about a single file written to by multiple processes.

I see. What I meant to say is that reducing the poll interval will only help to a certain extent; if the file is large, it might not respect the poll_interval and could still cause issues.

ceevaaa commented 1 week ago

Just to confirm, the file is definitely moved and /home/Application/login.log is recreated, as opposed to being copied to the archive and then truncated? I'm asking because if it's the latter then I would not be surprised by the loss.

""Once the file reaches 250MB in size, it is renamed and archived (rotated), and a new log file begins."" - team

Therefore no truncation takes place @djaglowski.

ceevaaa commented 1 week ago

Also, what do you guys think is better, esp in high throughput scenarios?

A. Reading logs from a file using the filelog receiver. [OR] B. Instrumenting the Java application so that the log can be exported via OTLP directly.

I know the answer might depend on multiple factors but in general what is better.

djaglowski commented 5 days ago

In theory it is more efficient to bypass the files, since you then don't have to write, read, and reinterpret the logs again.

open-telemetry / opentelemetry-collector-contrib