JDBC river failure then can't restart Elasticsearch

loganbhardy commented 9 years ago

Elasticsearch 1.4.4, elasticsearch-river-jdbc-1.4.4.0.jar, sqljdbc41.jar.

Rivers stop running on one of my four nodes. Removing and adding the river again will produce a _meta document but the _status document is missing. Restarting Elasticsearch fails on multiple attempts. I have to stop the process then start it separately. Once I've done this everything runs fine for a period of time. Days or weeks but eventually I end back up in this weird state. I can't find anything at all in the logs to note.

jprante commented 9 years ago

Do you have JDBC plugin installed on all nodes?

loganbhardy commented 9 years ago

Yes the plugin is installed on all nodes with along with the sql jar. Same version everywhere.

loganbhardy commented 9 years ago

I should also add that sometimes a full cluster restart is required to get things back up and running. A rolling restart of all the nodes does not always help.

loganbhardy commented 9 years ago

One last observation, the _state is not being removed when I delete a river. I thought this was addressed in 1.4.4.0.

jprante commented 9 years ago

Yes, you should a message river state deleted in the logs.

loganbhardy commented 9 years ago

So I did a full cluster restart. Then I delete a one-time river that began on startup. $ curl -XDELETE localhost:9200/_river/sales-attachment-river-one-time

I browse the _river index in head and confirm that it's gone. But I still get the following back when I look at _state.

$ curl -XGET localhost:9200/_river/jdbc/sales-attachment-river-one-time/_state?pretty { "state" : [ { "name" : "sales-attachment-river-one-time", "type" : "jdbc", "started" : "2015-04-06T21:50:00.688Z", "last_active_begin" : "2015-04-06T21:50:14.673Z", "last_active_end" : null, "map" : { "aborted" : false, "suspended" : false, "counter" : 1 } } ] }

My understanding is that I should no longer see the _state for that river. Is that correct?

loganbhardy commented 9 years ago

The weirdness doesn't stop there. I have another one-time river that does not show up when I get the _state.

$ curl -XGET localhost:9200/_river/jdbc/sales-task-river-one-time/_state?pretty { "state" : [ ] }

When I browse the _river index in head I see both a _status and a _meta document for it.

I try to delete the sales-task-river-one-time river but it acts like it doesn't exist. $ curl -XDELETE localhost:9200/_river/sales-task-river-one-time {"error":"RemoteTransportException[[Smart Alec][inet[/192.168.211.233:9300]][indices:admin/mapping/delete]]; nested: TypeMissingException[[_all] type[[sales-task-river-one-time]] missing: No index has the type.]; ","status":404}

Strangely I'm unable to GET the _status and _meta documents. $ curl -XGET localhost:9200/_river/sales-task-river-one-time/_status {"_index":"_river","_type":"sales-task-river-one-time","_id":"_status","found":false} $ curl -XGET localhost:9200/_river/sales-task-river-one-time/_meta {"_index":"_river","_type":"sales-task-river-one-time","_id":"_meta","found":false}

Here's where it gets really weird. I delete the _status document. $ curl -XDELETE localhost:9200/_river/sales-task-river-one-time/_status {"found":true,"_index":"_river","_type":"sales-task-river-one-time","_id":"_status","_version":2}

And now I can get the _meta document that wasn't found before. $ curl -XGET localhost:9200/_river/sales-task-river-one-time/_meta {"_index":"_river","_type":"sales-task-river-one-time","_id":"_meta","_version":1,"found":true,"_source":{"type":"jdbc","strategy":"simple","schedule":null,......} I won't post the whole document as it has my SQL password in it.

I've seen this before and the only workaround is to delete all the rivers and do a full cluster restart then reload them again. I'd love any ideas you might have.

loganbhardy commented 9 years ago

Update, my _river index consisted of two shards and around 90 documents. Mostly JDBC but we also use the couch river and a custom in house river. I deleted the entire index and restarted the cluster. The index was recovered with 6 shards (the default on my cluster) and 8 documents. I can't explain why this happened. Ideas?

Once again, I deleted the _river index and restarted the cluster and this time it stayed deleted. I'm reloading all my rivers and this time I'm letting the cluster automatically create the _river index with the default number of shards. I'll keep you posted on what happens from here.

loganbhardy commented 9 years ago

I believe what happened is that the river plugin was hung causing two nodes in the cluster to not respond to a restart. So I had to stop and then start the nodes and when they came back they wrote _state documents back to the cluster.

jprante / elasticsearch-jdbc

JDBC river failure then can't restart Elasticsearch #524