Closed SebScoFr closed 6 years ago
Hi @SebScoFr, would you mind listing the version of the plugin you're using?
I suspect it might have to do with this line: https://github.com/logstash-plugins/logstash-input-google_pubsub/blob/master/lib/logstash/inputs/google_pubsub.rb#L269, according to the Javadoc the value will be the total number of processors but it's being limited to 1 here.
I'm not entirely sure why they chose the number of processors because I suspect the thing is I/O bound to begin with. If I built you a gem with that parameter being modifiable (or even just fixed at ~16) would you be willing to test it out?
Hi @josephlewis42,
Thanks. That would be great indeed.
Hi @SebScoFr,
I've built a gem with that value hard-coded to 16 here if you want to give it a go: https://github.com/josephlewis42/personal_codebase/releases/download/logstash-release/logstash-input-google_pubsub-1.2.1.gem
Ok great, I'll give it a go tomorrow morning (I'm in London).
I'm using 1.2.0 btw.
Hi @josephlewis42,
I've just tested your gem, and sadly this is what I see
Not much difference unfortunately. Interestingly we are seeing a spike or two during the process but they never last. Could you do some testing on your side see what kind of results you get? Maybe it's something to do with my config somehow.
Hi @josephlewis42
So after further investigation, I actually don't think Logstash or even this plugin is the issue. I configured my pipeline to send the output to stdout
output {
stdout {}
}
And the monitoring now looks like this
The received rate is about 5 times higher, which makes it pretty clear that ES is actually my bottleneck here. I still need to figure out why.
Hi @SebScoFr,
Too bad it's not on the plugin side, I was hoping we could have a quick fix for you.
I seem to remember reading that upserts were slow on the ES side, would it be possible to re-create the index every day and use an index alias and flip the pointer to the most recent one once the job is done? e.g. create an index named my-index-2000-01-02
, insert all your data then change the alias my-index
that was pointing at my-index-2000-01-01
to point to my-index-2000-01-02
?
If you host on GCP and think it might be something infrastructure-wise, feel free to reach out to me at jlewisiii @ google.com
and I might be able to get you in touch with someone.
Cheers!
Hi @josephlewis42
Thanks again for looking into that.
The problem is that for SEO related reasons, we need to keep all the products that we have imported in the past. So if I create a new index every day then my latest index will only contain the products available on that current day, but I'll lose the old ones. I guess I don't have much choice but stick to that. It's not that bad, it takes 1 hour to import about 3M products, which is still acceptable. But it's a shame as it could be a lot faster if upserts weren't that slow :)
Thanks anyway for the great support. (We are on GCP btw and loving it!)
So let me try to provide as much context as possible.
We are using both ES and Logstash version 6.2.4. We have a cluster of 10 nodes, 16GB memory per node. Everyday we have a process responsible for indexing a bunch of documents (~5M) into ES by sending them to logstash (we use pubsub input / es output). Our Logstash instance is configured with 16GB of RAM, with -Xms8g -Xmx8g
logstash.yml has the following: pipeline.workers: 40 pipeline.batch.size: 5000
Finally the logstash pipeline looks like this:
We are experiencing a low ES indexing rate (about 1.5k/s on average). ES CPU usage and heap size is low (around 30%) so that is not the issue. Same for Logstash. When monitoring Logstash I noticed that the Events Received Rate was the bottleneck (see graphs below)
Logstash metrics
Elasticsearch indexing rate
Any idea why the Events Received Rate is this low? What would be the proper way to increase it here? I originally thought that max_messages was what I needed but it doesn't seem to help.