[DOC] Clarification regarding data prepper sinks.

opensearch-project / documentation-website

The documentation for OpenSearch, OpenSearch Dashboards, and their associated plugins.

https://opensearch.org/docs

Apache License 2.0

67 stars 453 forks source link

[DOC] Clarification regarding data prepper sinks. #7762

Open nateynateynate opened 1 month ago

nateynateynate commented 1 month ago

What do you want to do?

[X] Request a change to existing documentation
[X] Add new documentation
[ ] Report a technical problem with the documentation
[ ] Other

Tell us about your request. Provide a summary of the request.

Someone was asking about whether data prepper can "handle" apache avro data, and found that the documentation wasn't entirely clear. avro is listed as a codec for data prepper, but refers to it "most efficiently being used" in an S3 sink. Could we add a paragraph or so about how it can be used outside of an S3 sink?

Also - it seems to have some weird formatting oddities that make it a little hard to skim. See screenshots.

*Version: List the OpenSearch version to which this issue applies, e.g. 2.14, 2.12--2.14, or all.

2.15

What other resources are available? Provide links to related issues, POCs, steps for testing, etc.

hdhalter commented 1 month ago

@dlvenable - Can you please comment on this? Here is the link: https://opensearch.org/docs/latest/data-prepper/common-use-cases/codec-processor-combinations/#avro

dlvenable commented 1 month ago

Regarding the original question, Data Prepper can read Avro from S3 and write Avro to S3.

Regarding the documentation, we should revisit this page. The original intention was to clarify when a user should use a codec versus a processor for parsing input data.

I might reword this as:

Apache Avro is an open-source serialization format for record data. When reading Avro data you should use the avro codec.

dlvenable commented 1 month ago

I also noticed some question comments about Parquet.

Apache Parquet is a columnar storage format built for Hadoop. It is most efficient without the use of a codec. Positive results, however, can be achieved when it’s configured with S3 Select.

Perhaps this should say:

Apache Parquet is a columnar storage format built for Hadoop. Pipeline authors can use the parquet codec to read Parquet data directly from the S3 object. This will retrieve all data from Parquet. An alternative is to use S3 Select instead of the codec. In this case, S3 Select will parse the Parquet file directly (additional S3 charges apply). This can be more efficient if you are filtering or loading a subset of data.

hdhalter commented 1 month ago

@nateynateynate - Do you want to take a stab at pushing up the changes?