opensearch-project / data-prepper

OpenSearch Data Prepper is a component of the OpenSearch project that accepts, filters, transforms, enriches, and routes data at scale.
https://opensearch.org/docs/latest/clients/data-prepper/index/
Apache License 2.0
264 stars 203 forks source link

Add Syslog Source #4511

Open kclinden opened 6 months ago

kclinden commented 6 months ago

Is your feature request related to a problem? Please describe. I need to be able to collect raw syslog traffic from endpoints such as network devices. Today this would require some sort of log collector beyond data prepper such as Logstash.

Describe the solution you'd like Add a syslog source plugin similar to Logstash https://www.elastic.co/guide/en/logstash/current/plugins-inputs-syslog.html

Describe alternatives you've considered (Optional) Using logstash or fluentbit instead :(

Additional context Add any other context or screenshots about the feature request here.

KarstenSchnitter commented 6 months ago

Thanks, for supplying this issue. It is related to #2162. Let me share a few thoughts on your request.

TCP specification can be found in RFC5424. There are two transport protocols: UDP in RFC5425 and TCP in RFC5426. This separation gives an indication on the required implementations:

  1. There needs to be a TCP source with TLS support as required by RFC5424. UDP support is only recommended as "SHOULD", so this can be added later.
  2. There needs to be a syslog processor to parse the message format specified in RFC5424. This can be done with a grok configuration. Probably some performance optimisations need to happen. It might require a separate processor. This might also be necessary to correctly map the predefined values within certain fields.
  3. The syslog event contains a message (MSG) as payload. DataPrepper should at least support JSON parsing for this message out of the box. Again, this can be part of a pipeline. Probably a full configuration example will have several steps.

From my experience with different logging systems within SAP, syslog has several challenges. Correctly handling the TCP connections with regards to load-balancing and keep-alives is not easy. it is likely, that there is a great variety of load scenarios. There might be single applications, that are silent for a long period but keep the TCP connection open. DataPrepper needs to manage its resource well in that scenario. On the other hand, there might be a really high-throughput rsyslog process firing hundreds of thousands messages per second. Here buffering and throughput is a challenge. A proper backpressure on TCP ACK level would be a good idea. In summary, a TCP source for DataPrepper should expose the necessary configuration options to tune it to those situations. The TCP input plugin of Logstash is very basic and does not perform particularly well in either of those situations. If UDP transport is used, the issues of the TCP connection state are absent but so is any back pressure mechanism. An overloaded DataPrepper can only drop logs or crash entirely.

RFC5424 is relatively strict on the format of most of its fields. But not all syslog generators follow this approach tightly. There might be deviations for example in the date format or using quotation marks (") to mark fields and allow for spaces. The latter is done by CloudFoundry, for example. Ideally, DataPrepper would be resilient against those particularities. As always, it is possible, that parsing fails and messages need to be sent to a DLQ.

Zurkian commented 1 month ago

Thanks, for supplying this issue. It is related to #2162. Let me share a few thoughts on your request.

TCP specification can be found in RFC5424. There are two transport protocols: UDP in RFC5425 and TCP in RFC5426. This separation gives an indication on the required implementations:

  1. There needs to be a TCP source with TLS support as required by RFC5424. UDP support is only recommended as "SHOULD", so this can be added later.
  2. There needs to be a syslog processor to parse the message format specified in RFC5424. This can be done with a grok configuration. Probably some performance optimisations need to happen. It might require a separate processor. This might also be necessary to correctly map the predefined values within certain fields.
  3. The syslog event contains a message (MSG) as payload. DataPrepper should at least support JSON parsing for this message out of the box. Again, this can be part of a pipeline. Probably a full configuration example will have several steps.

From my experience with different logging systems within SAP, syslog has several challenges. Correctly handling the TCP connections with regards to load-balancing and keep-alives is not easy. it is likely, that there is a great variety of load scenarios. There might be single applications, that are silent for a long period but keep the TCP connection open. DataPrepper needs to manage its resource well in that scenario. On the other hand, there might be a really high-throughput rsyslog process firing hundreds of thousands messages per second. Here buffering and throughput is a challenge. A proper backpressure on TCP ACK level would be a good idea. In summary, a TCP source for DataPrepper should expose the necessary configuration options to tune it to those situations. The TCP input plugin of Logstash is very basic and does not perform particularly well in either of those situations. If UDP transport is used, the issues of the TCP connection state are absent but so is any back pressure mechanism. An overloaded DataPrepper can only drop logs or crash entirely.

RFC5424 is relatively strict on the format of most of its fields. But not all syslog generators follow this approach tightly. There might be deviations for example in the date format or using quotation marks (") to mark fields and allow for spaces. The latter is done by CloudFoundry, for example. Ideally, DataPrepper would be resilient against those particularities. As always, it is possible, that parsing fails and messages need to be sent to a DLQ.

I don't really understand why all this would be technical limitations. On the contrary, Syslog seems to me to be a basic source to be in this type of application for 2 reasons:

Why don't you integrate this protocol which is however essential in the world of log management? In addition, it seems that Opensearch already supports Syslog in its Security Analytics plugin (https://opensearch.org/docs/latest/security-analytics/log-types-reference/linux/).

I started looking at this solution because I find it very relevant to use Opensearch for my needs, and it is the recommended application dor that. It is a pity that it does not yet support basic and standard inputs like Syslog but it supports more aws specific source...

Currently, we cannot say that this solution is "scalable" due to these limitations.

KarstenSchnitter commented 1 month ago

Please do not feel discouraged by my comment. I just wanted to share my experience working with syslog. Syslog is a widely used protocol, that enables a lot of integrations. Because it is such a simple protocol, it is easy to start an implementation. But making it reliable and scalable is much more difficult. This is what I wanted to outline with my earlier comment. If you want to start a Syslog source for Data Prepper, I will gladly support you with it.