logstash-plugins / logstash-input-google_pubsub

Logstash input for pulling events from Google Pub/Sub service
Apache License 2.0
19 stars 33 forks source link

Low Events Received Rate #38

Closed SebScoFr closed 6 years ago

SebScoFr commented 6 years ago

So let me try to provide as much context as possible.

We are using both ES and Logstash version 6.2.4. We have a cluster of 10 nodes, 16GB memory per node. Everyday we have a process responsible for indexing a bunch of documents (~5M) into ES by sending them to logstash (we use pubsub input / es output). Our Logstash instance is configured with 16GB of RAM, with -Xms8g -Xmx8g

logstash.yml has the following: pipeline.workers: 40 pipeline.batch.size: 5000

Finally the logstash pipeline looks like this:

input {
  google_pubsub {
    project_id => "my-project-id"
    topic => "logstash"
    subscription => "logstash"
    max_messages => 5000
  }
}
filter {
  json {
    source => "message"
  }
  mutate {
    remove_field => ["message", "@version", "@timestamp"]
  }
}
output {
  elasticsearch {
    hosts => ["my-es-host"]
    user => "elastic"
    password => "password"
    index => "my-index"
    document_id => "%{[@metadata][uuid]}"
    action => "update"
    scripted_upsert => true
    script => "import"
    script_type => "indexed"
    script_lang => ""
  }
}

We are experiencing a low ES indexing rate (about 1.5k/s on average). ES CPU usage and heap size is low (around 30%) so that is not the issue. Same for Logstash. When monitoring Logstash I noticed that the Events Received Rate was the bottleneck (see graphs below)

Logstash metrics

LS

Elasticsearch indexing rate

ES

Any idea why the Events Received Rate is this low? What would be the proper way to increase it here? I originally thought that max_messages was what I needed but it doesn't seem to help.

josephlewis42 commented 6 years ago

Hi @SebScoFr, would you mind listing the version of the plugin you're using?

I suspect it might have to do with this line: https://github.com/logstash-plugins/logstash-input-google_pubsub/blob/master/lib/logstash/inputs/google_pubsub.rb#L269, according to the Javadoc the value will be the total number of processors but it's being limited to 1 here.

I'm not entirely sure why they chose the number of processors because I suspect the thing is I/O bound to begin with. If I built you a gem with that parameter being modifiable (or even just fixed at ~16) would you be willing to test it out?

SebScoFr commented 6 years ago

Hi @josephlewis42,

Thanks. That would be great indeed.

josephlewis42 commented 6 years ago

Hi @SebScoFr,

I've built a gem with that value hard-coded to 16 here if you want to give it a go: https://github.com/josephlewis42/personal_codebase/releases/download/logstash-release/logstash-input-google_pubsub-1.2.1.gem

SebScoFr commented 6 years ago

Ok great, I'll give it a go tomorrow morning (I'm in London).

I'm using 1.2.0 btw.

SebScoFr commented 6 years ago

Hi @josephlewis42,

I've just tested your gem, and sadly this is what I see

LS

Not much difference unfortunately. Interestingly we are seeing a spike or two during the process but they never last. Could you do some testing on your side see what kind of results you get? Maybe it's something to do with my config somehow.

SebScoFr commented 6 years ago

Hi @josephlewis42

So after further investigation, I actually don't think Logstash or even this plugin is the issue. I configured my pipeline to send the output to stdout

output {
  stdout {}
}

And the monitoring now looks like this

LS

The received rate is about 5 times higher, which makes it pretty clear that ES is actually my bottleneck here. I still need to figure out why.

josephlewis42 commented 6 years ago

Hi @SebScoFr,

Too bad it's not on the plugin side, I was hoping we could have a quick fix for you.

I seem to remember reading that upserts were slow on the ES side, would it be possible to re-create the index every day and use an index alias and flip the pointer to the most recent one once the job is done? e.g. create an index named my-index-2000-01-02, insert all your data then change the alias my-index that was pointing at my-index-2000-01-01 to point to my-index-2000-01-02?

If you host on GCP and think it might be something infrastructure-wise, feel free to reach out to me at jlewisiii @ google.com and I might be able to get you in touch with someone.

Cheers!

SebScoFr commented 6 years ago

Hi @josephlewis42

Thanks again for looking into that.

The problem is that for SEO related reasons, we need to keep all the products that we have imported in the past. So if I create a new index every day then my latest index will only contain the products available on that current day, but I'll lose the old ones. I guess I don't have much choice but stick to that. It's not that bad, it takes 1 hour to import about 3M products, which is still acceptable. But it's a shame as it could be a lot faster if upserts weren't that slow :)

Thanks anyway for the great support. (We are on GCP btw and loving it!)