opensearch-project / data-prepper

OpenSearch Data Prepper is a component of the OpenSearch project that accepts, filters, transforms, enriches, and routes data at scale.
https://opensearch.org/docs/latest/clients/data-prepper/index/
Apache License 2.0
267 stars 206 forks source link

RSS as a Source #972

Open stockholmux opened 2 years ago

stockholmux commented 2 years ago

Is your feature request related to a problem? Please describe. I have a source of data I'd like to ingest into OpenSearch via Data Prepper. This data source is provided-as-a-service and my only machine readable output is an RSS/Atom feed. Data Prepper has no way to natively ingest this type of data.

Describe the solution you'd like I'd like an entry-pipeline source type for RSS. I should be able to supply a URL and a polling frequency and Data Prepper will grab the data from the RSS URL every n seconds. Each <item> in the feed would be a document and the tags inside would be fields.

Describe alternatives you've considered (Optional)

Additional context Aside from this specific context, RSS is a surprisingly common output format for a rich variety of different types of tools. This would really allow OpenSearch + Data Prepper to ingest a large variety of different data without any extra coding.

dlvenable commented 2 years ago

This is an interesting feature for Data Prepper. It can be a good starting point for supporting full-text search use-cases within Data Prepper.

I'd like to propose that Data Prepper include this functionality with some limited scope to begin. Here are some limitations:

@stockholmux, Would these limitations work?

sshivanii commented 2 years ago

Use Case

Several customers use RSS(Really Simple Syndication/Rich Site Summary) feeds, which fetch latest news and events from websites in real-time and updates them into one convenient dashboard. These customers want to read the RSS feeds and send them to OpenSearch in a document type. Data Prepper can be leveraged to support this full-text search use case by providing an RSS Source Plugin.

This proposal provides details for supporting RSS as a source for Data Prepper.

Configuration

Here's what a basic pipeline configuration with RSS Source plugin would look like:

source:
  rss:
    url: "https://forum.opensearch.org/latest.rss"
    polling_frequency: PT5M

Configuration Options

Option Type Default Required Description
url String empty Yes The RSS feed URL
polling_frequency Duration 5 mins No Frequency to retrieve the RSS feed data

Out of Scope

Tasks

dlvenable commented 2 years ago

@sshivanii , Thank you for this proposal. I think the default for polling_frequency should be larger. I'd say on the order of minutes. Maybe 5 minutes - PT5M. Also, I don't think we need to require it since we have a default value.

sshivanii commented 2 years ago

@dlvenable Considering how often the feeds will get refreshed, I agree with keeping it longer, 5 minutes. Good catch, I'll make the polling_frequency optional.