Support dynamic paths based on attributes of logs

captainfalcon23 commented 1 year ago

Component(s)

exporter/file

Is your feature request related to a problem? Please describe.

I want to send our kubernetes pod logs to an opentelementry server and then write them to disk to log files based on attributes within the logs. For example: /data/logs/NAMESPACE_NAME/POD_NAME/CONTAINER_NAME/out.log

Describe the solution you'd like

As above, allow the file exporter to be able to read attributes of the logs and dynamically generate the log location.

Describe alternatives you've considered

No response

Additional context

No response

github-actions[bot] commented 1 year ago

Pinging code owners:

exporter/file: @atingchen

See Adding Labels via Comments if you do not have permissions to add labels yourself.

atingchen commented 1 year ago

@captainfalcon23 I think the essence of the problem is to export logs to different paths through specific attributes, thus achieving log isolation. To solve this problem, can we use the existing routingprocessor? The routingprocessor reads specific resource attributes and exports the logs to different fileexporters based on the attribute values.

captainfalcon23 commented 1 year ago

So I am starting to look at this, but still don't think it will meet the requirement. Let's say I get logs from a container with the following details: Container Name: zabbix-agent-1234567 Pod Name: zabbix-agent Namespace: foo Cluster (static value): ABC

In this case, I want to save logs to a location like: /mnt/data/logs/$Cluster/$Namespace/$PodName/$ContainerName.log

If I use routing processor, I think I still need to hard code values, and I have to know exactly where my pods are running and if there's some change, I need to manually update this configuration.

Another example is the host logs (e.g. /var/log/messages. I want these to go to a file which is dynamically named "$hostname.log". But again, it seems impossible?

What's your thoughts @atingchen ?

atingchen commented 1 year ago

If we need to implement dynamic file export paths, I think we need to follow these three steps:

We need to add a subdirectory_attribute in the configuration. It will be used to generate the export path.
```
path: /data
subdirectory_attribute:
- cluster
- namespace
```
We need to maintain a mapping between the file export path and the io.writer.
- If the export path is not in our mapping, we need to create a new io.writer.
- We also need to consider the eviction mechanism of this map, otherwise, there may be a risk of memory leakage.
Before exporting telemetry data each time, obtaining io.writer based on the file export path and exporting it.

If we implement it this way, the existing code needs to be significantly modified. I'd like to hear your thoughts on this. @atoulme @djaglowski Thank you.

djaglowski commented 1 year ago

There does seem to be some valuable utility here but it would certainly add a lot of complexity and responsibility to the component. I'm not opposed to it in theory but don't think I can make it a priority to help with it.

captainfalcon23 commented 1 year ago

No worries... I am actually about to start looking at some other solution as I can't get my use-case to work as it stands.

atoulme commented 1 year ago

Same way, I would use the routingprocessor and limit this to a set of known paths.

github-actions[bot] commented 11 months ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

exporter/file: @atingchen

See Adding Labels via Comments if you do not have permissions to add labels yourself.

captainfalcon23 commented 11 months ago

Happy for these to be closed if you guys like. I moved to another, more flexible solution (vector.dev)

github-actions[bot] commented 9 months ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

exporter/file: @atingchen

See Adding Labels via Comments if you do not have permissions to add labels yourself.

adam-kiss-sg commented 8 months ago

Hi!

I also need this, and I would like to implement this.

I'm thinking of only supporting a single resource level attribute for the subdirectory_attribute. Multiple attributes can be combined with the transform processor, attributes can be moved to the Resource level using the groupbyattrs processor.

My plan for the implementation (described for logs, but same goes for traces and metrics):

split fileExporter struct
- create a new struct fileWriter, move "file", "mutex" fields and part of flush and shutdown functions to it
- modify exportMessageAsLine and exportMessageAsBuffer to get fileWriter as first parameter
add an LRU cache for the fileWriters
- eg: https://github.com/hashicorp/golang-lru
- this is already used in the repository in a couple components, eg: https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/6f3ff4a37efe4591f728250d4ea9e4ad6672fb2b/processor/spanmetricsprocessor/internal/cache/cache.go#L18
in consumeLogs
- split plog.Logs by resource attribute
- there is an example for this in https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/connector/routingconnector/logs.go
- iterate over the groups, for each group:
- call marshaller
- call compressor
- get (or create) the correct fileWriter from the LRU cache, and call export with it

For the configuration, I would add:

group_by_attribute:
  subdirectory_attribute: "my_attribute" # (or maybe subpath_attribute instead, and allow '/' in it)
  max_open_files: 1000 # (this would set the size of the LRU cache)

What do you think? @atingchen @atoulme @djaglowski

filipeferreira89 commented 7 months ago

This feature would be interesting, we have a use-case where we save the application data in file to then send to coldstorage, in case, we need the raw logs in the future. We save the files per application name which it's a field in the log, we are achieving that at the moment with logstash, but we are making the effort to move the logging stack to otel.

djaglowski commented 7 months ago

It seems there is ongoing interest in this feature and someone willing to implement it with a detailed proposal. (Thanks for stepping forward @adam-kiss-sg.) I'm in favor of this and will be happy to review PRs.

The most immediate question in my mind is the config. Looking at the proposal, a few quick thoughts and a longer discussion:

group_by_attribute:
  subdirectory_attribute: "my_attribute" # (or maybe subpath_attribute instead, and allow '/' in it)
  max_open_files: 1000 # (this would set the size of the LRU cache)

I agree this should be a dedicated section. The name seems reasonable but I think group_by is sufficient.
For better or worse, we often use a pattern where we enable or disable subsections using an explicit enabled flag. I can't seem to find the previous discussion on this but if I recall correctly it is beneficial to helm charts and other tools which may have trouble distinguishing between nil and empty structs when rendering config.
We certainly need some sort of capacity setting. I think the proposed name is great as it is simple and user facing.

Defining the dynamic element of the path is the trickiest part of this proposal. One challenge I have not seen addressed is that our current path setting is used to specify not only the directory but the actual file name. e.g. path: my_dir/my_file.json

However, if we introduce a subdirectory_attribute, the user should still be able to specify a file name as well. e.g. The following doesn't quite get us there:

# my_dir/${my.attr}/????
path: my_dir # but no file name
group_by:
  subdirectory_attribute: my.attr

One way to address this is to add another setting within group_by to define the file name.

# my_dir/${my.attr}/my_file.json
path: my_dir # but no file name
group_by:
  subdirectory_attribute: my.attr
  filename: my_file.json

However, this seems awkward because it differs so much from the way path currently works. It's also not very flexible. Why not allow the file name to be the dynamic part?

To that end, when group_by is enabled, the path could be required to contain exactly one * (or equivalent indicator), which will be replaced by the attribute value. This keeps the file name responsibility in the path setting and allows more flexibility to the user. subdirectory_attribute would not be the right term in this case, so perhaps just attribute or replace_attribute.

# always my_dir/my_file.json
file/static:
  path: my_dir/my_file.json

# my_dir/foo/my_file.json
# my_dir/bar/my_file.json
file/dynamic_subdir:
  path: my_dir/*/my_file.json
  group_by:
    enabled: true
    replace_attribute: my.attr

# my_dir/foo/my_nested_dir/my_file.json
# my_dir/bar/my_nested_dir/my_file.json
file/dynamic_subdir_with_nested_subdirs:
  path: my_dir/*/my_nested_dir/my_file.json
  group_by:
    enabled: true
    replace_attribute: my.attr

# my_dir/foo.json
# my_dir/bar.json
file/dynamic_filename:
  path: my_dir/*.json
  group_by:
    enabled: true
    replace_attribute: my.attr

WDYT?

adam-kiss-sg commented 7 months ago

Hi @djaglowski !

Didn't see your comment before I opened the PR.

Config

for the namings, I'm open to any suggestions (see the other config values below as well)
enabled flag - I imagine this might be needed to use a default configuration where it's enabled, and override it with a disabled value (at least this came up for me in the past), I'm happy to add this

Current config values in the pr:

sub_path_resource_attribute: [no default]: specifies the name of the resource attribute that contains the subpath of the file to write to. The final path will be prefixed with the path config value. When this value is set, rotation setting is ignored.
delete_sub_path_resource_attribute: [default: false]: if set to true, the resource attribute specified in sub_path_resource_attribute config value is removed from the telemetry data before writing it to a file.
- added this in case someone wants to combine multiple attributes in the path (using transform processor), but still want's to write the original data to disk
max_open_files: [default: 100]: specifies the maximum number of open file descriptors for the output files.
discard_if_attribute_not_found: [default: false]: if set to true, and the processed resource does not have the resource attribute specified in sub_path_resource_attribute, the telemetry data is discarded.
- added two different handling for when the attribute is not present
default_sub_path: [default: "MISSING"]: value is used when the processed resource does not have the resource attribute specified in sub_path_resource_attribute, and discard_if_attribute_not_found is set to false. If discard_if_attribute_not_found is set to true, this setting is ignored.
auto_create_directories: [default: true] when enabled, if the directory of the destination file does not exist, will create the directory. If set to false and the directory does not exists, the write will fail and return an error.
- this also seemed important to create directories, but wanted to give the option to disable for security reasons
Path

I like your suggestion about the * in the path (instead of my original idea to use path as prefix).

I also have some questions about implementation, but I will ask them in the pr instead.

djaglowski commented 7 months ago

@adam-kiss-sg, let's keep the conversation here until we agree on the design. I appreciate the many edge cases you are thinking of but I suggest we should identify a smaller scope of changes which can be accepted first. We can still anticipate other changes may be made later.

sub_path_resource_attribute

It's certainly easier to restrict this to resource attributes, so let's start there. When discussing earlier it seemed likely we may want to group by log record attribute, but I think we can add a separate field later if necessary and just make them mutually exclusive. As far as naming, I think resource_attribute is sufficient.

delete_sub_path_resource_attribute

We should leave this out of the initial implementation, though I'm not opposed to adding it later. We can work out details on a later PR.

max_open_files

👍

discard_if_attribute_not_found & default_sub_path

These can be left out of the initial PR and instead we can just choose a reasonable default behavior. I suggest initially we just log a warning and drop the data. Assuming we eventually allow alternate behaviors, I think we can combine these settings, but again, would prefer to work those details out later.

auto_create_directories

We can consider this in a later PR as well. In the meantime, users who require strict security here should not enable group_by functionality.

Pulling in the other design questions from the PR:

Error handling

when creating a file fails because of permission issue

when creating a file fails because the parent directory doesn't exists, and auto_create_directories is set to false

when creating a directory fails

when closing a file fails I'm mostly concerned about cases where a batch of telemetry data contains multiple resources, and some of them are written successfully, while others fail.

I think all four cases should log an error. In the first three, the failure means we can't write, so we should continue to the next resource. The last case isn't really actionable though, so just logging an error is sufficient.

Generally speaking, these kinds of failures will typically result from the same cause and will require user action, so it should be enough to make noise when things are working. We should continue running and writing what we can.

Data modification

Should the exporter modify the telemetry data? I use ResourcesX.MoveTo() in my code, because it showed much better performance compared to CopyTo. There is also the delete_sub_path_resource_attribute config which would remove an attribute from the resource. If a pipeline contains multiple exporters, would this modify the data for the other exporters as well? What's the best practice here?

We have a way of handling this, by indicating in the factory setup that the component will mutate data. Example. It's fairly uncommon for exporters to do this but it is supported and reasonable if it improves performance. That said, there is a performance downside to consider as well, specifically that by setting this flag we're just pushing the copy upstream of the exporter. I'd lean towards setting the flag to keep things simple.

Rotation

I disabled rotation when group_by is active, mainly because I'm concerned about the interaction between rotation (using lumberjack) and the group_by functionality. You can also implement some kind of rotation with group_by by adding a date to the resource attribute, so this didn't seem so much of an issue (although it's time based, not size based, so not exactly the same functionality). What do you think about this?

We should be able to interoperate with the existing rotation configuration. The buildFileWriter does most of what we need. It looks like we just need to pass in the file path instead of pulling it from the config. Rotation should then be handled automatically for each base file.

README.md

I noticed this line in the README.md, should I remove it? This intended for primarily for debugging Collector without setting up backends.

I think it's reasonable to remove. It's probably still the most common use case, but I don't see any reason we should constrain it for users.

adam-kiss-sg commented 7 months ago

sub_path_resource_attribute

It's certainly easier to restrict this to resource attributes, so let's start there. When discussing earlier it seemed likely we may want to group by log record attribute, but I think we can add a separate field later if necessary and just make them mutually exclusive. As far as naming, I think resource_attribute is sufficient.

:+1:

delete_sub_path_resource_attribute

We should leave this out of the initial implementation, though I'm not opposed to adding it later. We can work out details on a later PR.

:+1:

discard_if_attribute_not_found & default_sub_path

These can be left out of the initial PR and instead we can just choose a reasonable default behavior. I suggest initially we just log a warning and drop the data. Assuming we eventually allow alternate behaviors, I think we can combine these settings, but again, would prefer to work those details out later.

Again, I understand the benefit of keeping this out of the configuration at first, but I'm not sure logging a warning for every dropped resource is a good idea. It can easily fill up the collector's log and cause issues if the disk gets full (this can easily happen in a containerized environment). I would suggest either dropping these logs silently (or with debug level), or restricting the warning to 1/hour, or going with a different approach (eg: handling this the same as if the attribute contained an empty string).

auto_create_directories

We can consider this in a later PR as well. In the meantime, users who require strict security here should not enable group_by functionality.

:+1:

Data modification

...

We have a way of handling this, by indicating in the factory setup that the component will mutate data. Example. It's fairly uncommon for exporters to do this but it is supported and reasonable if it improves performance. That said, there is a performance downside to consider as well, specifically that by setting this flag we're just pushing the copy upstream of the exporter. I'd lean towards setting the flag to keep things simple.

With the delete_sub_path_resource_attribute gone this is really just a question of using CopyTo vs MoveTo, so I would stick to using CopyTo in the exporter instead and not use the flag (there is no performance gain if the upstream just does the copy instead). It's good to know about the option though :)

Rotation

...

We should be able to interoperate with the existing rotation configuration. The buildFileWriter does most of what we need. It looks like we just need to pass in the file path instead of pulling it from the config. Rotation should then be handled automatically for each base file.

Yes, on the coding side this would be simple. But I have concerns about the interaction between group_by (which handles multiple files) and rotation (which also handles multiple files). Originally this was just a feeling but now I looked into the lumberjack code and collected some risks:

Lumberjack is leaking a gorutine - this is not an issue without the group_by because there is only one lumberjack.Logger instance, but the group_by would potentially create and close many Logger instances.
The max_days rotation setting won't work reliably, because by the time lumberjack would remove the old files, there could be no lumberjack Logger running with that filename.
When lumberjack checks files for rotation, it lists all files in the directory, which can be a real performance issue with tens of thousands of files in the same directory (which is easily possible with group_by accumulating files over weeks or months). This function is called when a new Logger is started, although in a separate gorutine.

On the other hand, I don't see much real use-case for using group_by and rotation together. If you have many files with group_by, it's unlikely you need to rotate them. If you have just a handful of files, you don't really need group_by, you could simply use a routing connector with multiple fileexporters.

I simply think the cons outweighs the benefits on this one, I think it would add an unnecessary runtime complexity.

README.md

I noticed this line in the README.md, should I remove it? This intended for primarily for debugging Collector without setting up backends.

I think it's reasonable to remove. It's probably still the most common use case, but I don't see any reason we should constrain it for users.

:+1:

Ps: I created the pr because I need this functionality in a project I'm working on asap, and I plan to use it from the fork until we figure out the quirks. I'm more than happy to take it step by step, creating multiple prs if necessary, I will just keep using my fork in the mean time.

djaglowski commented 7 months ago

Again, I understand the benefit of keeping this out of the configuration at first, but I'm not sure logging a warning for every dropped resource is a good idea. It can easily fill up the collector's log and cause issues if the disk gets full (this can easily happen in a containerized environment). I would suggest either dropping these logs silently (or with debug level), or restricting the warning to 1/hour, or going with a different approach (eg: handling this the same as if the attribute contained an empty string).

I'd rather we use debug log than get involved in rate limiting, etc. That is a problem for overall collector telemetry configuration.

Lumberjack is https://github.com/natefinch/lumberjack/issues/56 - this is not an issue without the group_by because there is only one lumberjack.Logger instance, but the group_by would potentially create and close many Logger instances.

This does appear to be a serious issue (though confusingly there are three PRs to fix it and no movement on it). In any case, I think this probably justifies holding off on the change for now. I'm including my thoughts on the other points anyways because if this is ever resolved I think we should have a clear path forward.

The max_days rotation setting won't work reliably, because by the time lumberjack would remove the old files, there could be no lumberjack Logger running with that filename.

If there's no logger with that filename, then we're not writing to the log and there's no need for rotation or cleanup. This doesn't seem like a problem to me.

When lumberjack checks files for rotation, it lists all files in the directory, which can be a real performance issue with tens of thousands of files in the same directory (which is easily possible with group_by accumulating files over weeks or months). This function is called when a new Logger is started, although in a separate gorutine.

The only case where this matters is if the user chooses to put all files in one directory. e.g. my_dir/*.json instead of my_dir/*/my_file.json. I think it's enough to document it as a performance optimization.

On the other hand, I don't see much real use-case for using group_by and rotation together. If you have many files with group_by, it's unlikely you need to rotate them. If you have just a handful of files, you don't really need group_by, you could simply use a routing connector with multiple fileexporters.

I'm much less sure that this is such an edge case. Anyone currently relying on rotation is one simple step away from taking advantage of the new feature and still expecting rotation. e.g. If you need rotation for N logs/second, then choose to split the log stream into two, you would still want rotation for N/2 logs/second.

adam-kiss-sg commented 7 months ago

@djaglowski I added a first pr: it contains no changes to the behavior, it only contains the restructure needed to add the group_by functionality. As a reference I will keep my original pr around.

adam-kiss-sg commented 7 months ago

Thanks for merging my first pr @djaglowski . I will close my original pr and create a new, clean one soon.

There is one last question that came up as I was starting to use my fork in our system: the fileexporter currently creates new files with 600 permission, but i need at least 640 for my use case. What do you think would be the best approach?

modify file permission to 640, or 644 (I would prefer this, but it could be considered a breaking change)
add an option to set file and directory permissions
leave it as is and address this in a future pr

djaglowski commented 7 months ago

644 seems reasonable to me but we may need to make it configurable at some point. Maybe best to treat it as a separate issue. That discussion could happen in parallel and we can pick it up in your PR if it's settled.

atoulme commented 5 months ago

Given that https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/31396 is merged, ok to close this issue?

github-actions[bot] commented 3 months ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

exporter/file: @atingchen

See Adding Labels via Comments if you do not have permissions to add labels yourself.

open-telemetry / opentelemetry-collector-contrib