DB Lookup Requirements - Githubissues

acchen97 commented 7 years ago

Outlining some requirements to ship v1. For reference purposes, the meta issue is here.

Currently as-is, the plugin will execute a SQL query per event, and store the full result set in a new field. This requires a query and network round trip for each event, which isn't ideal. We should make this more dynamic...

Goals

Enable the ability to lookup and enrich events from relational databases, commonly needed for users running CMDBs or using traditional data warehousing.

Features

Allow for caching of JDBC result datasets, hashed in-memory by a specific key for fast lookup and field enrichment
Lookups can be done with a primary key or composite key
Max cache size should be configurable
Periodic reloading of cache, in seconds - live refresh with subsequent JDBC query

When Logstash first starts up, it should automatically query the DB and load the result set locally in cache. For fast retrieval, each row can probably be hashed by the primary or composite key. Since we always know the key for every event, it should be O(1) access.

Configuration

filter {
  jdbc {
    # Setup JDBC connection
    jdbc_driver_library => "/path/to/mysql-connector-java-5.1.34-bin.jar"
    jdbc_driver_class => "com.mysql.jdbc.Driver"
    jdbc_connection_string => ""jdbc:mysql://localhost:3306/mydatabase"
    jdbc_user => "me"
    jdbc_password => "secret"

    # JDBC query statement; dataset to be cached for enrichment purposes
    statement => "select * from WORLD.COUNTRY"

    # Define the lookup keys used to lookup and enrich DB rows to Logstash events. It maps the DB column name(s) 
    # to the field name(s) in the event, essentially conducting a "join" on this key.
    # Each lookup should return a single unique result, or else it will just take first result found for enrichment.
    # Both primary and composite keys are supported. Composite keys will have multiple keys specified, like below.
    # DB column names referenced must be available in the query result dataset.
    # (Required) - Syntax:  "event field name" => "DB column name"
    lookup_keys => {
      "code" => "country_code"
      "org" => "organization"
    }

    # Define one or more DB columns to enrich the Logstash event with.
    # DB column names referenced must be available in the query result dataset.
    # The new Logstash event field name(s) will be the same as the respective column(s) it came from.
    # (Optional) - if undefined, adds all fields (excluding lookup_key fields) to event top level.
    target_fields => ["country_name", "country_details"]

    # Cache settings
    refresh_schedule => "0 * * * *"  # Refresh cache every hour. Cron syntax like the JDBC input scheduler. (Optional) - default = never refresh
    max_cache_size => "1MB"  # (Optional) - need to find a good default for this, maybe 500MB?
  }
}

Failure Handling

If lookups fail, we can tag it "_jdbclookupfailure".

acchen97 commented 7 years ago

Added high level requirements for this feature, would love your input @wiibaa @untergeek @jordansissel @suyograo

acchen97 commented 7 years ago

Closing in favor of https://github.com/elastic/logstash/issues/6502.

@wiibaa we discussed this today - we'll keep this plugin in its current functionality here and get it merged in. LRU caching could be a nice feature add. The requirements specified in this particular issue have been moved to the issue above, which will result in a separate plugin which caches the full query result set locally in LS for faster lookups.

logstash-plugins / logstash-filter-jdbc_streaming

DB Lookup Requirements #2

Goals

Features

Configuration

Failure Handling