Open ceevaaa opened 1 week ago
Pinging code owners:
receiver/filelog: @djaglowski
See Adding Labels via Comments if you do not have permissions to add labels yourself.
There are some circumstances where the receiver would not be able to keep up with log rotation. Specifically, if the files are being rotated to a location where they are not found in subsequent polling intervals, or if the log files are deleted before the receiver has had a chance to consume them (due to a rotation configuration that defines the max number of backups).
It sounds like you have a high throughput scenario so it seems possible. Are the rotated files being deleted eventually, and if so, can we establish in some way the frequency? If you can provide an estimate of the max number of backups and the average log size in bytes, then I think we can work out whether or not such a limitation is in play.
The other factor which I'm wondering about is whether the debug exporter is too slow to keep up, and may in fact be causing backpressure on the receiver which unnecessarily delays reading the logs. You could consider using the count
connector to assess this, because it should be much more efficient way to count the logs. e.g.
recievers:
filelog:
include: ...
connectors:
count:
processors:
deltatocumulative:
exporters:
debug:
service:
pipelines:
logs/in:
receivers: [ filelog ]
exporters: [ count ]
metrics/out:
receivers: [ count ]
processors: [ deltatocumulative ] # aggregate "+1"s into cumulative count
exporters: [ debug ] # only print the counts, rather than the logs themselves
Let me share a little more around the directory and how the log rotation is being handled.
Below is the directory structure.
├── Application
│ ├── login.log
│ ├── main.java
│ ├── archive
│ │ ├── 2024-09-11.login.01.log
│ │ ├── 2024-09-11.login.02.log
│ │ ├── 2024-09-11.login.03.log
│ │ ├── .....
The logs are continuously written to login.log
by the main.java
. Once the file exceeds 250MB
in size, it is moved to the archive
folder. And a new file is created replacing the old one.
Here is the snapshot that describes how quickly login.log
files are archived.
These vary from 1m
, 2m
and so on.
The config you shared with deltatocumulative
didn't work. Maybe I am missing something here.
Sep 12 12:43:19 ip-10-23-180-96.ap-south-1.compute.internal systemd[1]: Started OpenTelemetry Collector Contrib.
Sep 12 12:43:19 ip-10-23-180-96.ap-south-1.compute.internal otelcol-contrib[28494]: 2024/09/12 12:43:19 collector server run finished with error: failed to get config: cannot unmarshal the configuration: 1 error(s) decoding:
Sep 12 12:43:19 ip-10-23-180-96.ap-south-1.compute.internal otelcol-contrib[28494]: * error decoding 'processors': unknown type: "deltatocumulative" for id: "deltatocumulative" (valid values: [resourcedetection batch cumulativetodelta k8sattributes groupbyattrs experimental_metricsgeneration sumologic tail_sampling memory_limiter deltatorate filter redaction span transform attributes groupbytrace probabilistic_sampler routing metricstransform remotetap resource])
Sep 12 12:43:19 ip-10-23-180-96.ap-south-1.compute.internal systemd[1]: otelcol-contrib.service: main process exited, code=exited, status=1/FAILURE
Sep 12 12:43:19 ip-10-23-180-96.ap-south-1.compute.internal systemd[1]: Unit otelcol-contrib.service entered failed state.
Sep 12 12:43:19 ip-10-23-180-96.ap-south-1.compute.internal systemd[1]: otelcol-contrib.service failed.
Sep 12 12:43:20 ip-10-23-180-96.ap-south-1.compute.internal systemd[1]: otelcol-contrib.service holdoff time over, scheduling restart.
Sep 12 12:43:20 ip-10-23-180-96.ap-south-1.compute.internal systemd[1]: Stopped OpenTelemetry Collector Contrib.
I used it without the deltatocumulative
processor and this is the snippet from the result. The count never went more than 100
(is this intentional ?).
Sep 12 14:01:43 : 2024-09-12T14:01:43.114+0530 info log.record.count{} 100
Sep 12 14:01:43 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:43 : 2024-09-12T14:01:43.146+0530 info log.record.count{} 100
Sep 12 14:01:43 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:43 : 2024-09-12T14:01:43.177+0530 info log.record.count{} 100
Sep 12 14:01:43 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:43 : 2024-09-12T14:01:43.183+0530 info log.record.count{} 9
Sep 12 14:01:43 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:43 : 2024-09-12T14:01:43.241+0530 info log.record.count{} 100
Sep 12 14:01:43 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:43 : 2024-09-12T14:01:43.284+0530 info log.record.count{} 65
Sep 12 14:01:43 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:43 : 2024-09-12T14:01:43.335+0530 info log.record.count{} 100
Sep 12 14:01:43 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:43 : 2024-09-12T14:01:43.365+0530 info log.record.count{} 100
Sep 12 14:01:43 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:43 : 2024-09-12T14:01:43.383+0530 info log.record.count{} 27
Sep 12 14:01:43 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:43 : 2024-09-12T14:01:43.469+0530 info log.record.count{} 100
Sep 12 14:01:43 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:43 : 2024-09-12T14:01:43.483+0530 info log.record.count{} 37
Sep 12 14:01:43 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:43 : 2024-09-12T14:01:43.510+0530 info log.record.count{} 100
Sep 12 14:01:43 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:43 : 2024-09-12T14:01:43.520+0530 info log.record.count{} 100
Sep 12 14:01:43 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:43 : 2024-09-12T14:01:43.561+0530 info log.record.count{} 100
Sep 12 14:01:43 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:43 : 2024-09-12T14:01:43.584+0530 info log.record.count{} 53
Sep 12 14:01:43 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:43 : 2024-09-12T14:01:43.683+0530 info log.record.count{} 13
Sep 12 14:01:43 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:43 : 2024-09-12T14:01:43.783+0530 info log.record.count{} 17
Sep 12 14:01:43 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:43 : 2024-09-12T14:01:43.883+0530 info log.record.count{} 41
Sep 12 14:01:43 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:43 : 2024-09-12T14:01:43.983+0530 info log.record.count{} 69
Sep 12 14:01:43 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:44 : 2024-09-12T14:01:44.083+0530 info log.record.count{} 81
Sep 12 14:01:44 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:44 : 2024-09-12T14:01:44.183+0530 info log.record.count{} 45
Sep 12 14:01:44 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:44 : 2024-09-12T14:01:44.283+0530 info log.record.count{} 55
Sep 12 14:01:44 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:44 : 2024-09-12T14:01:44.383+0530 info log.record.count{} 55
Sep 12 14:01:44 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:44 : 2024-09-12T14:01:44.484+0530 info log.record.count{} 6
Sep 12 14:01:44 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:44 : 2024-09-12T14:01:44.583+0530 info log.record.count{} 88
Sep 12 14:01:44 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:44 : 2024-09-12T14:01:44.683+0530 info log.record.count{} 3
Sep 12 14:01:44 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:44 : 2024-09-12T14:01:44.783+0530 info log.record.count{} 88
Sep 12 14:01:44 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:44 : 2024-09-12T14:01:44.883+0530 info log.record.count{} 57
Sep 12 14:01:44 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:44 : 2024-09-12T14:01:44.983+0530 info log.record.count{} 25
Sep 12 14:01:44 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:45 : 2024-09-12T14:01:45.084+0530 info log.record.count{} 87
Sep 12 14:01:45 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:45 : 2024-09-12T14:01:45.183+0530 info log.record.count{} 20
Sep 12 14:01:45 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:45 : 2024-09-12T14:01:45.283+0530 info log.record.count{} 6
Sep 12 14:01:45 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:45 : 2024-09-12T14:01:45.383+0530 info log.record.count{} 4
Sep 12 14:01:45 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:45 : 2024-09-12T14:01:45.484+0530 info log.record.count{} 85
Sep 12 14:01:45 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:45 : 2024-09-12T14:01:45.583+0530 info log.record.count{} 66
Sep 12 14:01:45 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:45 : 2024-09-12T14:01:45.683+0530 info log.record.count{} 42
Sep 12 14:01:45 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:45 : 2024-09-12T14:01:45.783+0530 info log.record.count{} 55
Sep 12 14:01:45 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:45 : 2024-09-12T14:01:45.883+0530 info log.record.count{} 45
Sep 12 14:01:45 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:45 : 2024-09-12T14:01:45.983+0530 info log.record.count{} 36
Sep 12 14:01:45 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:46 : 2024-09-12T14:01:46.084+0530 info log.record.count{} 15
Sep 12 14:01:46 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:46 : 2024-09-12T14:01:46.183+0530 info log.record.count{} 25
Sep 12 14:01:46 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:46 : 2024-09-12T14:01:46.284+0530 info log.record.count{} 71
Sep 12 14:01:46 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:46 : 2024-09-12T14:01:46.383+0530 info log.record.count{} 9
Sep 12 14:01:46 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:46 : 2024-09-12T14:01:46.484+0530 info log.record.count{} 13
Sep 12 14:01:46 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:46 : 2024-09-12T14:01:46.583+0530 info MetricsExporter {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 1, "data points": 1}
Sep 12 14:01:46 : 2024-09-12T14:01:46.583+0530 info log.record.count{} 91
Sep 12 14:01:46 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:46 : 2024-09-12T14:01:46.683+0530 info log.record.count{} 3
Sep 12 14:01:46 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:46 : 2024-09-12T14:01:46.783+0530 info log.record.count{} 29
Sep 12 14:01:46 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:46 : 2024-09-12T14:01:46.883+0530 info log.record.count{} 63
Sep 12 14:01:46 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:46 : 2024-09-12T14:01:46.983+0530 info log.record.count{} 33
Sep 12 14:01:46 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:47 : 2024-09-12T14:01:47.083+0530 info log.record.count{} 22
Sep 12 14:01:47 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:47 : 2024-09-12T14:01:47.183+0530 info log.record.count{} 70
Sep 12 14:01:47 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:47 : 2024-09-12T14:01:47.283+0530 info log.record.count{} 3
Sep 12 14:01:47 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:47 : 2024-09-12T14:01:47.383+0530 info log.record.count{} 38
Sep 12 14:01:47 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:47 : 2024-09-12T14:01:47.483+0530 info log.record.count{} 35
Sep 12 14:01:47 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:47 : 2024-09-12T14:01:47.583+0530 info log.record.count{} 79
Sep 12 14:01:47 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:47 : 2024-09-12T14:01:47.683+0530 info log.record.count{} 35
Sep 12 14:01:47 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:47 : 2024-09-12T14:01:47.783+0530 info log.record.count{} 72
Sep 12 14:01:47 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:47 : 2024-09-12T14:01:47.884+0530 info log.record.count{} 94
Sep 12 14:01:47 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:47 : 2024-09-12T14:01:47.983+0530 info log.record.count{} 96
Sep 12 14:01:47 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:48 : 2024-09-12T14:01:48.083+0530 info log.record.count{} 47
Sep 12 14:01:48 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:48 : 2024-09-12T14:01:48.183+0530 info log.record.count{} 5
Sep 12 14:01:48 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:48 : 2024-09-12T14:01:48.284+0530 info log.record.count{} 78
Sep 12 14:01:48 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:48 : 2024-09-12T14:01:48.383+0530 info log.record.count{} 52
Sep 12 14:01:48 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:48 : 2024-09-12T14:01:48.484+0530 info log.record.count{} 39
Sep 12 14:01:48 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:48 : 2024-09-12T14:01:48.583+0530 info log.record.count{} 72
Sep 12 14:01:48 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:48 : 2024-09-12T14:01:48.783+0530 info log.record.count{} 57
Sep 12 14:01:48 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:48 : 2024-09-12T14:01:48.883+0530 info log.record.count{} 13
Sep 12 14:01:48 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:48 : 2024-09-12T14:01:48.983+0530 info log.record.count{} 41
Sep 12 14:01:48 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:49 : 2024-09-12T14:01:49.083+0530 info log.record.count{} 43
Sep 12 14:01:49 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:49 : 2024-09-12T14:01:49.155+0530 info log.record.count{} 100
Sep 12 14:01:49 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:49 : 2024-09-12T14:01:49.183+0530 info log.record.count{} 25
Sep 12 14:01:49 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:49 : 2024-09-12T14:01:49.283+0530 info log.record.count{} 96
Sep 12 14:01:49 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:49 : 2024-09-12T14:01:49.383+0530 info log.record.count{} 19
Sep 12 14:01:49 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:49 : 2024-09-12T14:01:49.483+0530 info log.record.count{} 28
Sep 12 14:01:49 : {"kind": "exporter", "data_type": "metrics", "name": "debug"}
Sep 12 14:01:49 : 2024-09-12T14:01:49.583+0530 info log.record.count{} 57
Since the deltatocumulative
didn't work, currently I don't have an answer to whether the logs are dropped or not.
Open questions/ideas @djaglowski
poll_interval
help ? We tried decreasing it, but still lost logs. I checked this by summing the output of count
from the debug
and verifying it with actual log count for a time range (did for a range of just one second).Will decreasing the poll_interval help ? We tried decreasing it, but still lost logs. I checked this by summing the output of count from the debug and verifying it with actual log count for a time range (did for a range of just one second).
@djaglowski It seems like this issue might be related to this issue. What do you say? Reducing the polling interval can be beneficial, but only up to a certain point.
@ceevaaa i think you can utilize the fileexporter in this case to gather all the logs and verify the count.
Once the file exceeds 250MB in size, it is moved to the archive folder. And a new file is created replacing the old one.
Just to confirm, the file is definitely moved and /home/Application/login.log
is recreated, as opposed to being copied to the archive and then truncated? I'm asking because if it's the latter then I would not be surprised by the loss.
I used it without the deltatocumulative processor and this is the snippet from the result. The count never went more than 100 (is this intentional ?).
I'm not sure the status of the deltatocumulative
processor and only meant to provide the general idea, so the config may not be valid. There may be a way to make it work, but it sounds like you've found a reasonable way to get the total.
The count maxing at 100 is a consequence to batches being emitted by the filelog receiver once they reach 100 logs.
@ceevaaa i think you can utilize the fileexporter in this case to gather all the logs and verify the count.
I agree with this. If you're not using the deltatocumulative
, then dumping counts into a file like this is a better way to be sure your count is accurate.
Will decreasing the poll_interval help ?
Only if the receiver is able to keep up with ingestion. For example, if you set the poll interval to 10ns but it takes 100ms to read each time, then you're essentially just telling it to start a new poll as soon as the previous one finishes.
All that said, it sounds like there may be a real issue here. I'd like to reiterate my request for a decent estimate of the average log size. This would help us establish what the behavior should look like.
We know:
If we know the average log size, we can compute how often the file should be rotating. This may or may not lead to an explanation but I'd appreciate if we can start there so I can understand the scenario better.
@djaglowski It seems like this issue might be related to https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/17846. What do you say?
@VihasMakwana I don't see the connection. That issue requires there to be multiple separate files. This issue is about a single file written to by multiple processes.
Here are the updated details about the logs (average taken over 5 different log files) @djaglowski.
size of log file ~ 250MB
avg no of log lines ~ 800,987
avg size of each log line ~ 325 bytes
~ 250,000 log lines written per min
rotated after size >= 250MB
Just to confirm, the file is definitely moved and /home/Application/login.log is recreated, as opposed to being copied to the archive and then truncated? I'm asking because if it's the latter then I would not be surprised by the loss.
I have asked the team to confirm this. For reference, we are using log4j2
.
@ceevaaa i think you can utilize the fileexporter in this case to gather all the logs and verify the count.
I used the fileexporter
and it worked much better then debug
. Thanks @VihasMakwana.
We set the poll_interval
to 1us
and used the fileexporter
to verify the logs. AND IT WORKED.
At least the limited testing we could do today (manually picking random client_id
and cross verifying, randomly picking other attributes as well, greping the log.file.record_number
to see if the line number matches without skipping any lines).
I will continue to test and update you guys here.
PS: Keeping it open for some more time, till we get everything stabilized.
@VihasMakwana I don't see the connection. That issue requires there to be multiple separate files. This issue is about a single file written to by multiple processes.
I see. What I meant to say is that reducing the poll interval will only help to a certain extent; if the file is large, it might not respect the poll_interval and could still cause issues.
Just to confirm, the file is definitely moved and /home/Application/login.log is recreated, as opposed to being copied to the archive and then truncated? I'm asking because if it's the latter then I would not be surprised by the loss.
""Once the file reaches 250MB in size, it is renamed and archived (rotated), and a new log file begins."" - team
Therefore no truncation takes place @djaglowski.
Also, what do you guys think is better, esp in high throughput scenarios?
A. Reading logs from a file using the filelog receiver. [OR] B. Instrumenting the Java application so that the log can be exported via OTLP directly.
I know the answer might depend on multiple factors but in general what is better.
In theory it is more efficient to bypass the files, since you then don't have to write, read, and reinterpret the logs again.
Component(s)
receiver/filelog
What happened?
Description
I am trying to read logs from a log file.
A bit about the log files
Is there a limit on how quickly can filelog receiver read logs from a file ?
A snippet from the log file
Expected Result
All the log lines are read, processed and sent to next hop.
Actual Result
The filelog receiver misses reading lot of log lines. In fact almost 70% of the log lines are missed while reading.
Collector version
v0.104.0
Environment information
Environment
NAME="CentOS Linux" VERSION="7 (Core)" ID="centos" ID_LIKE="rhel fedora" VERSION_ID="7" PRETTY_NAME="CentOS Linux 7 (Core)" ANSI_COLOR="0;31" CPE_NAME="cpe:/o:centos:centos:7" HOME_URL="https://www.centos.org/" BUG_REPORT_URL="https://bugs.centos.org/" CENTOS_MANTISBT_PROJECT="CentOS-7" CENTOS_MANTISBT_PROJECT_VERSION="7" REDHAT_SUPPORT_PRODUCT="centos" REDHAT_SUPPORT_PRODUCT_VERSION="7" CPU: 4vCPU RAM: 8GB
OpenTelemetry Collector configuration
Internal Telemetry
In the below image the log receiver skips tons of lines.