opensearch-project / data-prepper

Data Prepper is a component of the OpenSearch project that accepts, filters, transforms, enriches, and routes data at scale.
https://opensearch.org/docs/latest/clients/data-prepper/index/
Apache License 2.0
259 stars 191 forks source link

Ingest data from ODBC/JDBC datasources as Source #1995

Open ashoktelukuntla opened 1 year ago

ashoktelukuntla commented 1 year ago

Is your feature request related to a problem? Please describe.

Pipeline author wants to read data from desired databases. Plugin needs to provide interface to read/ingest data with a JDBC interface.

Describe the solution you'd like

The interface should also provide ability to read data and run queries periodically. Every row read will be converted to Data prepper event. Columns should be mapped as fields in event. I would envision JDBC driver libraries to be provided in yml configuration by pipeline author. User should be able to pass required configuration for drivers under "jdbc_driver_lib". Additionally for scheduling a periodic run a cron like syntax configuration should be passed in the yml.

The plugin should be able to support SigV4 and accept awsCrediential provider, region, security parameters in yml configuration which should include trust & keystore configurations - trustStoreLocation,trustStoreType,trustStorePassword, keyStoreLocation,KeyStoreType,keyStorePassword

The plugin should include support for multi-node worker partitioning.

source:
    - jdbc:
          jdbc_driver_lib: "jdbc-oracle.jar"
          jbdc_driver:"oracle.jdbc.driver.OracleDriver"
          jdbc_connection_string:"jdbc:oracle://127.0.0.1:8080"
          jdbc_user:"user"
          jdbc_schedule:"* * * 3 *"
          sql_query:"SELECT EMPLOYEE_ID FROM EMPLOYEES WHERE LAST_NAME= :LAST_NAME"
          fetchSize: " "
          awsCredentialsProvider: "com.amazonaws.opensearch.sql.jdbc.shadow.com.amazonaws.auth.AWSCredentialsProvider"

Additional context

https://github.com/opensearch-project/sql-jdbc

sharraj commented 1 year ago

We should also add support for an elaborate partitioning strategy for multi-node parallel worker deployment.