feature: add bytes limit for a complete log

yutingcaicyt commented 1 year ago

Component(s)

pkg/stanza

Is your feature request related to a problem? Please describe.

version: v0.61.0 When I use recombine operator to aggregate log entries, I found some problems:

there is a "max_batch_size" limiting the nums of entries for a complete log, however, I think the "bytes limit" is what the users want to set. With "max_batch_size", it's hard to use and difficult to limit the maximum bytes of a complete log.
when the "max_batch_size" is reached, the entries in a batch would be aggregated and flushed, then the batch becomes empty. At this time, if some entries which belong to the previous log come in, they would be flushed individually. I think the better way is to aggregate these remaining entries together because they originally belong to a complete log.

Describe the solution you'd like

Add a config parameter "max_log_size" to recombine the operator. This parameter means the maximum bytes of the log. If the bytes size of entries in a batch reaches the limit, then all entries in the batch will be aggregated and flushed. With this parameter, I think the old "max_batch_size" is useless and is considered to be deleted.
Once the size of a batch reaches the limit and flush all entries of the batch, the "truncate" info needs to be recorded. And if some entries belonging to the previous log come in, then the "truncate" info can tell us these entries should be aggregated.

Describe alternatives you've considered

No response

Additional context

No response

yutingcaicyt commented 1 year ago

@djaglowski Any idea about this feature?

github-actions[bot] commented 1 year ago

Pinging code owners for pkg/stanza: @djaglowski. See Adding Labels via Comments if you do not have permissions to add labels yourself.

djaglowski commented 1 year ago

when the "max_batch_size" is reached... At this time, if some entries which belong to the previous log come in, they would be flushed individually.

This looks like a separate issue from the max_log_size proposal. Can it be addressed separately?

djaglowski commented 1 year ago

Add a config parameter "max_log_size" to recombine the operator. This parameter means the maximum bytes of the log. If the bytes size of entries in a batch reaches the limit, then all entries in the batch will be aggregated and flushed.

Sounds reasonable. The immediate question is whether or not it should be the concern of the recombine operator to truncate the final line before batching. It's not clear to me that it should, but your proposal indicates otherwise. I may need some help understanding the considerations here.

djaglowski commented 1 year ago

With this parameter, I think the old "max_batch_size" is useless and is considered to be deleted.

I don't see why this would be useless. Perhaps max_log_size would be better, but some may prefer to use this still. In any case, I don't see why a user couldn't chose one or the other or even both options.

djaglowski commented 1 year ago

the "truncate" info needs to be recorded. And if some entries belonging to the previous log come in, then the "truncate" info can tell us these entries should be aggregated.

If I'm understanding correctly, you are suggesting that the truncated data should become the first line in the next batch. Is that right?

djaglowski commented 1 year ago

Regarding truncation, an alternative would be to drop truncated data and any additional lines until is_first_entry/is_last_entry is satisfied. I think your suggestion is preferable though.

yutingcaicyt commented 1 year ago

when the "max_batch_size" is reached... At this time, if some entries which belong to the previous log come in, they would be flushed individually.

This looks like a separate issue from the max_log_size proposal. Can it be addressed separately?

yeah, these two issues can be addressed separately.

yutingcaicyt commented 1 year ago

Add a config parameter "max_log_size" to recombine the operator. This parameter means the maximum bytes of the log. If the bytes size of entries in a batch reaches the limit, then all entries in the batch will be aggregated and flushed.

Sounds reasonable. The immediate question is whether or not it should be the concern of the recombine operator to truncate the final line before batching. It's not clear to me that it should, but your proposal indicates otherwise. I may need some help understanding the considerations here.

Firstly, I think the best time to truncate the log is before the aggregation of entries because there would be some redundant action if we do it after the aggregation. Secondly, we use recombine operator to aggregate the entries to a complete log, so adding the truncate action to recombine operator is intuitive. And In this case, users just need to add a config for the recombine operator instead of a new operator.

yutingcaicyt commented 1 year ago

With this parameter, I think the old "max_batch_size" is useless and is considered to be deleted.

I don't see why this would be useless. Perhaps max_log_size would be better, but some may prefer to use this still. In any case, I don't see why a user couldn't chose one or the other or even both options.

For this, If users want to limit the bytes size of a log, 'max_log_size' is precise and simple, in which case 'max_batch_log' is perfectly replaced. Will users use 'max_batch_size' to limit other things?

yutingcaicyt commented 1 year ago

the "truncate" info needs to be recorded. And if some entries belonging to the previous log come in, then the "truncate" info can tell us these entries should be aggregated.

If I'm understanding correctly, you are suggesting that the truncated data should become the first line in the next batch. Is that right?

Actually, this solution addresses the second problem I mentioned, which means that once a log was truncated, a mark 'isTruncated' would be recorded, and when the remaing entries which belong to previous log come in, the mark can tell that a truncate action is happened at last flush, then we can aggregate them together.

yutingcaicyt commented 1 year ago

Regarding truncation, an alternative would be to drop truncated data and any additional lines until is_first_entry/is_last_entry is satisfied. I think your suggestion is preferable though.

yeah, I also considered this alternative, but I think dropping the data directly may not be a very good choice for a agent. Whether such trucated log is useful should be left to the users to judge.

djaglowski commented 1 year ago

For this, If users want to limit the bytes size of a log, 'max_log_size' is precise and simple, in which case 'max_batch_log' is perfectly replaced. Will users use 'max_batch_size' to limit other things?

We do have to consider existing users. The filelog receiver is in beta which means at least that we will try very hard to support backwards compatibility in the configuration. (More info)

I agree max_log_size is a generally superior way to define a limitation. That said, I can imagine some scenarios where max_batch_size is still useful. For example, a custom log format may use a fixed number of lines but not have a well defined pattern for is_first_entry or is_last_entry. In any case, we may be able to deprecate it eventually but I think in the near term we'll need to retain support for it for existing users.

Edit: One additional consideration here, is that max_log_size may be more compute or memory intensive. There may be users who value performance but are not constrained by a specific number of bytes.

djaglowski commented 1 year ago

once a log was truncated, a mark 'isTruncated' would be recorded, and when the remaing entries which belong to previous log come in, the mark can tell that a truncate action is happened at last flush, then we can aggregate them together

I agree with the overall functionality but I think we should explore in the implementation whether we really need the mark. It may be enough to just truncate, flush, and insert the remainder as the first line in the new batch.

djaglowski commented 1 year ago

dropping the data directly may not be a very good choice for a agent. Whether such trucated log is useful should be left to the users to judge.

I'm good with this. If it proves necessary in the future, it should be easy enough to add a configuration flag for this.

Aneurysm9 commented 1 year ago

How does limiting to a maximum size in bytes work? In what representation is the size measured? Does this require serialization and length checking the serialized data and then throwing away the serialized data?

yutingcaicyt commented 1 year ago

For this, If users want to limit the bytes size of a log, 'max_log_size' is precise and simple, in which case 'max_batch_log' is perfectly replaced. Will users use 'max_batch_size' to limit other things?

We do have to consider existing users. The filelog receiver is in beta which means at least that we will try very hard to support backwards compatibility in the configuration. (More info)

I agree max_log_size is a generally superior way to define a limitation. That said, I can imagine some scenarios where max_batch_size is still useful. For example, a custom log format may use a fixed number of lines but not have a well defined pattern for is_first_entry or is_last_entry. In any case, we may be able to deprecate it eventually but I think in the near term we'll need to retain support for it for existing users.

Edit: One additional consideration here, is that max_log_size may be more compute or memory intensive. There may be users who value performance but are not constrained by a specific number of bytes.

OK，I agree with it.

yutingcaicyt commented 1 year ago

once a log was truncated, a mark 'isTruncated' would be recorded, and when the remaing entries which belong to previous log come in, the mark can tell that a truncate action is happened at last flush, then we can aggregate them together

I agree with the overall functionality but I think we should explore in the implementation whether we really need the mark. It may be enough to just truncate, flush, and insert the remainder as the first line in the new batch.

There is another point needs to be discuss. when the bytes size of entries in a batch reaches the limit, how to deal with these entries? I think there is two way:

Limit the bytes size restrictly: split the last entry into two parts, flush previous entries with the first part, then insert the remaining part into the new batch
Guarantee the integrity of the log and limit the bytes size softly: just flush all the entries in the batch. In this case, the bytes size of a log may be greater than max_log_size. Both ways are common in the situation of "size limit".

yutingcaicyt commented 1 year ago

How does limiting to a maximum size in bytes work? In what representation is the size measured? Does this require serialization and length checking the serialized data and then throwing away the serialized data?

I think if limit bytes size in recombine operator, there are no serialization requirements. And when the data comes to recombine operator from the file reader, the body of the entry is a string or bytes array whose bytes size can be easily obtained. Are there any issues I missed in the process?

yutingcaicyt commented 1 year ago

@djaglowski Hello, I would like to work on this feature. Is there anything I should know before starting (Besides the contributing guidelines)? Thank you

github-actions[bot] commented 1 year ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

pkg/stanza: @djaglowski

See Adding Labels via Comments if you do not have permissions to add labels yourself.

github-actions[bot] commented 1 year ago

This issue has been closed as inactive because it has been stale for 120 days with no activity.

open-telemetry / opentelemetry-collector-contrib