How to increase the output throughput for inserting the records

rkhapre commented 5 years ago

Currently i am inserting 191277 records in database. My source is csv I am using 12 workers and 6000 batch size. My input completes in 2-3 minutes and filter operation get stuck, and event after 1 hr the output is as below. ( 2158)

How can i make sure i increase my output throughput. Any suggestion I have tried various options like below, but no improvement max_pool_size => 50 connection_timeout => 100000

This below i get from monitoring API

"events" : {
    "in" : 191277,
    "filtered" : 68630,
    "out" : 2158,
    "duration_in_millis" : 5406886,
    "queue_push_duration_in_millis" : 79073
  }

theangryangel commented 5 years ago

Check your logs to see if you’re getting backed up because of an sql exception being retried.

Bear in mind your connection timeout should not exceed the sql servers timeout.

rkhapre commented 5 years ago

Thanks for your reply. I tried this. But i didnt see any warning /error or anything like sql exception in the log. It just that it get choked at output level.

On connection timeout i kept less that the sql server timeout. With this slight improvement but not drastic

Earlier every second it was loading in batches of 11, now it is loading in batches of 20 and sometime 30-35. So thats the improvement i see.

But here i am trying to load in batches of 6000, which is not going through

Do you have any benchmarks for uploading huge dataset/ some numbers as how much time it takes to flush the output data.

If benchmarks on performance are available, that will give me confidence to invest more time to come up with right configuration for this plugin

Thanks

theangryangel commented 5 years ago

I’ve found performance varies between drivers, java versions, the work being done, logstash version, etc.

I have no benchmark numbers available specifically for MySQL, but I can tell you anecdotally I’ve had it running reading input from files and syslog and outputting to MSSQL for months at a time, dealing with millions of rows with no backlog. I have done large batch jobs of many tens of thousands of rows to back fill data again with no backlog. But again it’s MSSQL and I know my environment and it’s tweaked accordingly. I have not tested it with logstash 6.2 or newer as I am in the process of retiring the system and reason this plugin was written for in the first place.

Sent from my iPhone

On 3 Jul 2019, at 19:56, rkhapre notifications@github.com wrote:

Thanks for your reply. I tried this. But i didnt see any warning /error or anything like sql exception in the log. It just that it get choked at output level.

On connection timeout i kept less that the sql server timeout. With this slight improvement but not drastic

Earlier every second it was loading in batches of 11, now it is loading in batches of 20 and sometime 30-35. So thats the improvement i see.

But here i am trying to load in batches of 6000, which is not going through

Do you have any benchmarks for uploading huge dataset/ some numbers as how much time it takes to flush the output data.

If benchmarks on performance are available, that will give me confidence to invest more time to come up with right configuration for this plugin

Thanks

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

rkhapre commented 5 years ago

I am using 6.8, you mean you will not be continuing to maintain this plugin going forward.? That's very sad. I think there is a possibility that it is slow because of 6.8 version

theangryangel commented 5 years ago

The intention is to maintain the plugin as best as I can, but I cannot keep up with elastic's release schedule. I will have the system I'm using it for in place for at least another year, but it's marked as legacy. It's not something I can dedicate enough time in the evenings or weekends at the moment.

I'm happy for others to step in as co-maintainers. Plenty of people fork the repo, some add features, but very few send any PRs. The majority of the work is helping people with their specific setup via email, and very often it's simply configuration mistakes (which could probably be solved by better documentation and tests, but again I'm finding it hard to make time for it).

oak1998 commented 5 years ago

I just imported 500,000 rows from one mysql instance to another , which took around 55 minutes, it seems not bad as compare with the result of @rkhapre. I am using the default performance settings, while the logstash version is 6.8 too, and the mysql version is 8.0. I recommend you to check the JDBC driver to make sure the version is as the same as your mysql instance .

stale[bot] commented 5 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

theangryangel / logstash-output-jdbc

How to increase the output throughput for inserting the records #158