theangryangel / logstash-output-jdbc

JDBC output for Logstash
MIT License
255 stars 101 forks source link

Batch inserts #110

Open rgr78 opened 6 years ago

rgr78 commented 6 years ago

Hi.

I noticed the plugin issues one insert statement per event.

Would it be possible for the plugin to implement batch inserts? It would increase insert throughput and reduce server load considerably for large workloads.

Maybe something like this: https://www.tutorialspoint.com/jdbc/jdbc-batch-processing.htm

theangryangel commented 6 years ago

The original version of the plugin did do exactly this (this was back in the logstash 1.4 era)

The problem was that if one insert in the batch fails, the entire batch would fail and there was no way to figure out which of the batch failed. It turned out that a lot of people aren't capable of writing statements that won't fail, or they had inputs which werent always providing the data they expected. This resulted in a loss of events.

Unless I'm mistaken there's no way to deal with this other than the way the plugin now functions?

rgr78 commented 6 years ago

Yes, but that depends on the use case. On some use cases, it may be important to know which event caused the batch to fail. On other use cases, throughput may be more important. Or maybe errors are already ignored (with something like INSERT IGNORE).

Shouldn't that decision be made by the user? Maybe the batch insert feature can be optional and disabled by default.

Regarding dealing with batch failure and pinpointing the culprit, what about this approach: https://vladmihalcea.com/how-to-find-which-statement-failed-in-a-jdbc-batch-update/

theangryangel commented 6 years ago

Ah it's starting to come back to me now.

I did start to implement something and I think I found some JDBC driver(s) I was testing with behaved differently when it came to how they worked with failures. Some stopped processing immediately with an error and some carried on. This made the retry logic a pain in the arse to deal with and I think I just decided to give up at that point as retrying was far more important at the time.

In theory there's nothing wrong with implementing batch support as a switch/option. For support purposes I'd prefer to leave it as off by default based on my previous experience. It's unlikely I'll get around to implementing this any time soon as it's not something I need at work (our rate of events in the JDBC output is really low) and my personal time is limited at the moment.

If you'd like to work on a patch I'll happily accept it, after review. Just let me know if you do you want to start on it. Otherwise this'll likely be at least a few weeks/months before I get around to looking at it.

rgr78 commented 6 years ago

I see what you mean.

For now, our use case is much slower than it should/could be (about 10M~20M records). But since the total time is still tolerable, we also won't start working on it immediately. If we do, we'll let you know.

Thanks!

theangryangel commented 6 years ago

No problem đź‘Ť I'll keep the issue open, if you dont mind? That way I dont forget about this as a feature to look into in the future?

rgr78 commented 6 years ago

Sure. đź‘Ť

phr0gz commented 6 years ago

Hello, just to let you know that I'm also inserting a lot of data in MS SQL. And I have the same issue. :+1:

rkhapre commented 5 years ago

Need Batch inserts for sure +1

theangryangel commented 5 years ago

Pull requests are welcome. It’s unlikely I’m going to implement this any time soon still. I have no need for this feature at work and I cannot justify the time at home as I’m flat out on other things.

ankitgoel154 commented 5 years ago

is there any way we can increase pipeline worker thread size to achieve the same thing until the enhancement is done?

theangryangel commented 5 years ago

Increase the number of logstash workers (and ensure you have the same +1 connection pool size) and it should increase the output as long as your sql server can keep up. Or at least that’s what ligstash used to do. I have not needed to tweak this myself and I have not kept on top of this.

ibspoof commented 5 years ago

I had a need for batching using both safe and unsafe statements with retry support. Changes have been added to my fork at: https://github.com/ibspoof/logstash-output-jdbc/tree/add_batching if my implementation looks good I can file a MR.

theangryangel commented 5 years ago

Edit: thank you for your work so far @ibspoof

Looks good on paper, but I’ve not tried it out (I’m away from home and office today). Only thing that’s probably missing are some specs, but if you wish I can probably add some. If you want to have a go feel free.

Whenever you’re ready file the MR, regardless I’ll try and find some time Monday to have a quick play and test

ibspoof commented 5 years ago

I can add specs this weekend and submit.

vector4wang commented 5 years ago

expect batch inserts ~~~

theangryangel commented 5 years ago

The main problem I see with the batch implementation @ibspoof is that you're always putting all the batch_events back on the retry list. I was going to merge it straight in until I noticed this :/

Depending on what errors you are either going to end up duplicating a bunch of events, or potentially not putting any in at all, if it's the first statement that fails. I believe you need to account for this by dealing with getUpdateCounts and looping through getNextException. From what I recall in the depths of my memory trying to do this for 1.x I had real problems with JDBC drivers behaving differently, and I just gave up as I didn't need it myself.

If you don't need/want to worry about the above, are you happy if I take your work and do something with it?