[pkg/ottl] Add ParseMySQL converter to handle parsing/normalizing MySQL logs

TylerHelmuth commented 1 month ago

Component(s)

pkg/ottl

Is your feature request related to a problem? Please describe.

I came across an issue recently that I could not easily solve using OTTL's existing ExtractPatterns and replace_pattern functions.

I was experimenting with ingesting MySQL slow query logs and found that normalization of the query (replacing the parts of the slow query that are unique such as input parameters) was very hard to do with regex.

Consider a slow query log such as

# Time: 140905  6:33:11
# User@Host: dbuser[dbname] @ hostname [1.2.3.4]
# Query_time: 0.116250  Lock_time: 0.000035 Rows_sent: 0  Rows_examined: 20878
SET timestamp=1723472310;
SELECT
    UGI.dept_id,
    UGI.dept_name,
    COUNT(UGI.id) AS totalInnovators,
    SUM(UGI.totalIdeas) AS totalIdeas,
    SUM(UGI.points) AS totalPoints
FROM
    (
        SELECT
            UPI.id,
            UPI.dept_id,
            DI.name as dept_name,
            UPI.points,
            COUNT(UI.idea_id) AS totalIdeas
        FROM
            user_info AS UPI
            INNER JOIN dept_info AS DI ON UPI.dept_id = DI.id
            LEFT JOIN user_idea AS UI ON UPI.id = UI.user_id
        GROUP BY
            UPI.id
    ) AS UGI
GROUP BY
    UGI.dept_id
HAVING
    (
        totalPoints > 0
        AND totalInnovators > 0
    )
ORDER BY
    SUM(UGI.points) / COUNT(UGI.id) DESC
LIMIT
    5

Assuming this string is in the body of a log, it would be nice to extract:

Time
User
Host
Query time
Lock time
Rows sent
Rows examined
The slow query, normalized

Getting all these parts using the ExtractPattern converter is pretty straight forward (although it is a long regex). But normalizing the query was very hard. We were were able to come up with a regex that was close, but it isn't perfect and is quite complex.

  transform:
    error_mode: ignore
    log_statements:
      - context: log
        statements:
          - 'merge_maps(attributes, ExtractPatterns(body, "^# Time: (?P<time>[^ ]+)Z\\n# User@Host: (?P<user>[^#]+) @ (?P<client>[^#]+?) (Id:[^#]+)?\\n# Query_time: (?P<query_time>[0-9.]+) *Lock_time: (?P<lock_time>[0-9.]+) *Rows_sent: (?P<rows_sent>[0-9]+) *Rows_examined: (?P<rows_examined>[0-9]+)( *Rows_affected: (?P<rows_affected>[0-9]+))?(?P<normalized_query>(.|\\s)*)"), "upsert")'
          - set(attributes["query"], attributes["normalized_query"])
          - replace_pattern(attributes["normalized_query"], "(=\\s*((\"[^\"]*\")|('[^']*')|([0-9][^,;)\\s]*)))", "= ?")
          - replace_pattern(attributes["normalized_query"], "(([rR][oO][wW])?\\((((\"[^\"]*\")|('[^']*')|([0-9][^,;)\\s]*))[,\\s]*)+)", "(?")

Describe the solution you'd like

Normalizing the query in an efficient, consistent, and guaranteed manner requires a parser that understands SQL.

With a converter that knows how to interpret these logs, we could simplify (and probably speed up) this process. Instead of the several complex regex statements above, we could do something like

  transform:
    error_mode: ignore
    log_statements:
      - context: log
        statements:
          - merge_maps(attributes, ParseMySQL(body), "upsert")

and the ParseMySQL converter would return a map like:

{
    "time": "40905  6:33:11",
    "user": "dbuser[dbname]",
    "host": "hostname [1.2.3.4]",
    "query_time": "0.116250",
    "lock_time": "0.000035",
    "rows_sent": "0",
    "rows_examined": "20878",
    "normalized_query":"<seen formatted normalized query>"
}

SET
    timestamp = ?;

SELECT
    UGI.dept_id,
    UGI.dept_name,
    COUNT(UGI.id) AS totalInnovators,
    SUM(UGI.totalIdeas) AS totalIdeas,
    SUM(UGI.points) AS totalPoints
FROM
    (
        SELECT
            UPI.id,
            UPI.dept_id,
            DI.name as dept_name,
            UPI.points,
            COUNT(UI.idea_id) AS totalIdeas
        FROM
            user_info AS UPI
            INNER JOIN dept_info AS DI ON UPI.dept_id = DI.id
            LEFT JOIN user_idea AS UI ON UPI.id = UI.user_id
        GROUP BY
            UPI.id
    ) AS UGI
GROUP BY
    UGI.dept_id
HAVING
    (
        totalPoints > ? # notice this is now a ? instead of a unique value
        AND totalInnovators > ? # notice this is now a ? instead of a unique value
    )
ORDER BY
    SUM(UGI.points) / COUNT(UGI.id) DESC
LIMIT
    ? # notice this is now a ? instead of a unique value

Describe alternatives you've considered

Regex, lots of regex.

Additional context

Honeycomb has a sql normalizer that we use today that we could use in OTTL to do the normalization: https://github.com/honeycombio/mysqltools/tree/main

github-actions[bot] commented 1 month ago

Pinging code owners for pkg/ottl: @TylerHelmuth @kentquirk @bogdandrutu @evan-bradley. See Adding Labels via Comments if you do not have permissions to add labels yourself.

github-actions[bot] commented 1 month ago

Pinging code owners:

pkg/ottl: @TylerHelmuth @kentquirk @bogdandrutu @evan-bradley

See Adding Labels via Comments if you do not have permissions to add labels yourself.

kentquirk commented 1 month ago

This is not a problem that can be solved generally by regex, as the SQL language is both complex and recursively defined, so a tool like this would be helpful.

But we should recognize that MySql is not the only flavor of SQL, and name the function generically, and probably give the function a parameter that could specify a SQL variant (even if only one is supported initially).

open-telemetry / opentelemetry-collector-contrib