scylladb / scylladb

NoSQL data store using the seastar framework, compatible with Apache Cassandra
http://scylladb.com
GNU Affero General Public License v3.0
13.51k stars 1.29k forks source link

nodetool info should be more tolerant to API calls failures (fails with 404 on http://localhost:10000/storage_service/rpc_server endpoint - Thrift related) #19923

Open enaydanov opened 3 months ago

enaydanov commented 3 months ago

Calling nodetool info during a node bootstrap can fail with something like:

Command: "/usr/bin/nodetool -u cassandra -pw 'cassandra'  info "
Exit code: 4
Stdout:
ID                     : 9b1a71fe-9fe1-4713-85bf-fac922a6c792
Gossip active          : true
Stderr:
error executing GET request to http://localhost:10000/storage_service/rpc_server with parameters {}: remote replied with status code 404 Not Found:
Not found

(from https://argus.scylladb.com/test/4c332b0b-a707-40b2-881f-cf7f37a33b6e/runs?additionalRuns[]=bdfd3c7a-e805-4a56-8eaa-7ddfe8cbc44c)

This is because handler for /storage_service/rpc_server endpoint set up very close to the end of initialization:

https://github.com/scylladb/scylladb/blob/1094c71282d8841ddb8af98ba5ae761d78572b6d/main.cc#L2086-L2100

nodetool info code doesn't have any error handling code:

https://github.com/scylladb/scylladb/blob/1094c71282d8841ddb8af98ba5ae761d78572b6d/tools/scylla-nodetool.cc#L916-L922

This is really user-unfriendly to expose 404 here, especially for the endpoint which exists for the compatibility reason.

mykaul commented 3 months ago

What is your expectation wrt 404? (as opposed to other errors, do you feel there's any point in retrying or something?)

enaydanov commented 3 months ago

For example, we can show incomplete info and warn user about this. Also we can instruct user what to do next: run nodetool again, or check logs, etc.

Don't think retries is a really good idea from the UX pov.

mykaul commented 3 months ago

@tchaikov - it seems to fail on Thrift - which I believe is removed already?

mykaul commented 2 months ago

@tchaikov - ping?

tchaikov commented 2 months ago

sorry, i missed this one. i am on it now.

tchaikov commented 2 months ago

@tchaikov - it seems to fail on Thrift - which I believe is removed already?

the symptom is indeed related to thrift. but the problem is not limited to it. the root cause is that the handler exposing this RESTful API is not up yet, when the nodetool accesses the web server. i am not sure what is the best way to address this UX issue. what i can do, though, is to minimize the front that faces this problem.

but please note, client is not guaranteed to be functional before the server is fully up and running.

denesb commented 2 months ago

We cannot just swallow errors just to be more "friendly" to the user. Nodetool cannot be used while ScyllaDB is initializing.

We can add a new REST API endpoint, which is registered very early, and which can be used to poll whether ScyllaDB is ready. When this endpoint returns false, nodetool prints a warning that ScyllaDB is still initializing and the user should try again later.

Note that we have to be careful with this, because some commands might be useful while ScyllaDB is initializing, so we might want to add exceptions to this.

mykaul commented 2 months ago

We can add a new REST API endpoint, which is registered very early, and which can be used to poll whether ScyllaDB is ready. When this endpoint returns false, nodetool prints a warning that ScyllaDB is still initializing and the user should try again later.

Note that we have to be careful with this, because some commands might be useful while ScyllaDB is initializing, so we might want to add exceptions to this.

Ref https://github.com/scylladb/scylladb/issues/8275