logstash-plugins / logstash-input-jdbc

Logstash Plugin for JDBC Inputs
Apache License 2.0
449 stars 187 forks source link

jdbc-pagination mysql extremely slow #307

Open ozhyrenkov opened 5 years ago

ozhyrenkov commented 5 years ago

I have logstash runnig in docker: FROM docker.elastic.co/logstash/logstash-oss:6.2.2 And docker-compose:

version: '3'

services:
 logstash:
    container_name: my_logstash
    build:
      context: .
    restart: always
    volumes:
      - ./config/logstash.yml:/usr/share/logstash/config/logstash.yml:ro
      - ./pipeline:/usr/share/logstash/pipeline:ro
      - ./data:/data
    ports:
      - "5009:5000"
    environment:
      LS_JAVA_OPTS: "-Xmx256m -Xms256m"

There is a pipeline.yml file:

input {
# companies
  jdbc {
    jdbc_driver_library => "/data/mysql-connector-java.jar"
    jdbc_driver_class => "com.mysql.jdbc.Driver"
    jdbc_connection_string => "jdbc:mysql://connection_string"
    jdbc_user => "user"
    jdbc_password => "password"
    statement_filepath  => "/data/query1.sql"
    jdbc_paging_enabled => "true"
    jdbc_page_size => "50000"

  }
# users
  jdbc {
    jdbc_driver_library => "/data/mysql-connector-java.jar"
    jdbc_driver_class => "com.mysql.jdbc.Driver"
    jdbc_connection_string => "jdbc:mysql://connection_string"
    jdbc_user => "user"
    jdbc_password => "password"
    statement_filepath  => "/data/query2.sql"
    last_run_metadata_path => "/data/.last_run"
    jdbc_paging_enabled => "true"
    jdbc_page_size => "50000"

  }

# users
}

output {
    amazon_es {
        hosts => [ "es_instance address.eu-west-1.es.amazonaws.com" ]
        region => "eu-west-1"
        index => "indexname"
        document_type => "user"
        document_id => "%{md5}"
    }
}

Users-input is just a simple query like this:

SELECT
        u.md5_hash                      as md5
        ,u.id                                  as id_users
        ,u.email                            as id_user_emails
FROM raw_users u
WHERE u.updated_at > :sql_last_value

Table raw_users has an indexes by id and updated_at

What jdbc_pagination is doing is, basically transforming query in dozen of such ones:

SELECT * FROM (SELECT
    u.md5_hash          as md5
    ,u.id               as id_users
    ,u.email                as id_user_emails
FROM raw_users u
WHERE u.updated_at > '1970-01-01 00:00:00'
) AS `t1` LIMIT 50000 OFFSET 500000 

But such queries are extremely slow and the higher offset the slower query exeution, and table has ~37 mln rows. Here the reference about slowing the speed, it's because of late row lookups: http://devoluk.com/mysql-limit-offset-performance.html https://explainextended.com/2009/10/23/mysql-order-by-limit-performance-late-row-lookups/

ozhyrenkov commented 5 years ago

Probably, solution is to change style of paging to such one, but this probably will work only for simple queries, so maybe additional parameter like _one_tablestatement or something will handle this performance problem?

SELECT * 
FROM   products 
JOIN (SELECT id 
      FROM   products 
      ORDER  BY date 
      LIMIT  0, 10) AS t ON t.id = products.id; 

With this approach it will be faster as it showed in research from first link in first post: chart

yiippee commented 4 years ago

Me too. Have you optimized the problem? Thanks a lot.