opensearch-project / data-prepper

Data Prepper is a component of the OpenSearch project that accepts, filters, transforms, enriches, and routes data at scale.
https://opensearch.org/docs/latest/clients/data-prepper/index/
Apache License 2.0
238 stars 176 forks source link

Add Syslog Source #4511

Open kclinden opened 1 month ago

kclinden commented 1 month ago

Is your feature request related to a problem? Please describe. I need to be able to collect raw syslog traffic from endpoints such as network devices. Today this would require some sort of log collector beyond data prepper such as Logstash.

Describe the solution you'd like Add a syslog source plugin similar to Logstash https://www.elastic.co/guide/en/logstash/current/plugins-inputs-syslog.html

Describe alternatives you've considered (Optional) Using logstash or fluentbit instead :(

Additional context Add any other context or screenshots about the feature request here.

KarstenSchnitter commented 1 month ago

Thanks, for supplying this issue. It is related to #2162. Let me share a few thoughts on your request.

TCP specification can be found in RFC5424. There are two transport protocols: UDP in RFC5425 and TCP in RFC5426. This separation gives an indication on the required implementations:

  1. There needs to be a TCP source with TLS support as required by RFC5424. UDP support is only recommended as "SHOULD", so this can be added later.
  2. There needs to be a syslog processor to parse the message format specified in RFC5424. This can be done with a grok configuration. Probably some performance optimisations need to happen. It might require a separate processor. This might also be necessary to correctly map the predefined values within certain fields.
  3. The syslog event contains a message (MSG) as payload. DataPrepper should at least support JSON parsing for this message out of the box. Again, this can be part of a pipeline. Probably a full configuration example will have several steps.

From my experience with different logging systems within SAP, syslog has several challenges. Correctly handling the TCP connections with regards to load-balancing and keep-alives is not easy. it is likely, that there is a great variety of load scenarios. There might be single applications, that are silent for a long period but keep the TCP connection open. DataPrepper needs to manage its resource well in that scenario. On the other hand, there might be a really high-throughput rsyslog process firing hundreds of thousands messages per second. Here buffering and throughput is a challenge. A proper backpressure on TCP ACK level would be a good idea. In summary, a TCP source for DataPrepper should expose the necessary configuration options to tune it to those situations. The TCP input plugin of Logstash is very basic and does not perform particularly well in either of those situations. If UDP transport is used, the issues of the TCP connection state are absent but so is any back pressure mechanism. An overloaded DataPrepper can only drop logs or crash entirely.

RFC5424 is relatively strict on the format of most of its fields. But not all syslog generators follow this approach tightly. There might be deviations for example in the date format or using quotation marks (") to mark fields and allow for spaces. The latter is done by CloudFoundry, for example. Ideally, DataPrepper would be resilient against those particularities. As always, it is possible, that parsing fails and messages need to be sent to a DLQ.