Open stockholmux opened 2 years ago
This is an interesting feature for Data Prepper. It can be a good starting point for supporting full-text search use-cases within Data Prepper.
I'd like to propose that Data Prepper include this functionality with some limited scope to begin. Here are some limitations:
aggregate
processor and de-duplication capability to support multiple nodes. All the nodes would read the same items and then the aggregate
processor de-duplicates them.@stockholmux, Would these limitations work?
Several customers use RSS(Really Simple Syndication/Rich Site Summary) feeds, which fetch latest news and events from websites in real-time and updates them into one convenient dashboard. These customers want to read the RSS feeds and send them to OpenSearch in a document type. Data Prepper can be leveraged to support this full-text search use case by providing an RSS Source Plugin.
This proposal provides details for supporting RSS as a source for Data Prepper.
Here's what a basic pipeline configuration with RSS Source plugin would look like:
source:
rss:
url: "https://forum.opensearch.org/latest.rss"
polling_frequency: PT5M
Option | Type | Default | Required | Description |
---|---|---|---|---|
url | String | empty | Yes | The RSS feed URL |
polling_frequency | Duration | 5 mins | No | Frequency to retrieve the RSS feed data |
@sshivanii , Thank you for this proposal. I think the default for polling_frequency
should be larger. I'd say on the order of minutes. Maybe 5 minutes - PT5M
. Also, I don't think we need to require it since we have a default value.
@dlvenable Considering how often the feeds will get refreshed, I agree with keeping it longer, 5 minutes. Good catch, I'll make the polling_frequency
optional.
Is your feature request related to a problem? Please describe. I have a source of data I'd like to ingest into OpenSearch via Data Prepper. This data source is provided-as-a-service and my only machine readable output is an RSS/Atom feed. Data Prepper has no way to natively ingest this type of data.
Describe the solution you'd like I'd like an
entry-pipeline
source
type for RSS. I should be able to supply a URL and a polling frequency and Data Prepper will grab the data from the RSS URL every n seconds. Each<item>
in the feed would be a document and the tags inside would be fields.Describe alternatives you've considered (Optional)
Additional context Aside from this specific context, RSS is a surprisingly common output format for a rich variety of different types of tools. This would really allow OpenSearch + Data Prepper to ingest a large variety of different data without any extra coding.