Kapex and other production node endpoint links commented without justification.

polkadot-js / apps

Basic Polkadot/Substrate UI for interacting with a Polkadot and Substrate node. This is the main user-facing application, allowing access to all features available on Substrate chains.

https://dotapps.io

Apache License 2.0

1.75k stars 1.55k forks source link

Kapex and other production node endpoint links commented without justification. #9619

Open chrisdcosta opened 1 year ago

chrisdcosta commented 1 year ago

I'm not sure why this is happening, but recently our production endpoint wss://k-ui.kapex.network has been removed from being accessible in the Apps UI.

What is the current behavior and expected behavior?

This appears to be a recent change which removes the endpoint from the list automatically if a script is not able to reach the end point. Our logs show no downtime.

This may or may not be a bug, but I think that in principle all production RPC endpoints should not be removed unless requested to do so by the teams or after having attempted to contact the teams directly in the event that an endpoint appears to be unresponsive.

We have third party wallet services that have not reported issues connecting to our endpoint.

jacogr commented 1 year ago

See https://github.com/polkadot-js/apps/blob/1e10ba8883a59b9d4172210dba04259985eb85b7/packages/apps-config/src/endpoints/productionRelayPolkadot.ts#L346 (it is commented)

It has been tested as unreachable for 3 runs in a row (it runs twice daily).

So clicking through the link -

Wed, Jun 21 2023, 1:09:45 pm (3rd) https://github.com/polkadot-js/apps/issues/9616
Wed, Jun 21 2023, 2:11:31 am (2nd) https://github.com/polkadot-js/apps/issues/9614
Tue, Jun 20 2023, 1:08:32 pm (1st) https://github.com/polkadot-js/apps/issues/9613

TL;DR if an endpoint is detected as unreachable in 3 runs, it gets marked as such with the link to the actual (final) detection run. This is not specific to this endpoint, there are a number that gets marked as such as soon as detected offline - https://github.com/polkadot-js/apps/pulls?q=is%3Apr+is%3Aclosed+unreachable

chrisdcosta commented 1 year ago

There are many reasons why a node could be unreachable from CI tests. Indeed EFF Lets Encrypt auto certificate renewal from within Github and Gitlab sometimes fails, and not because those services are "unavailable".

What purpose does it serve to comment the end point? Does it get uncommented when it is reachable again? If not, why not?

jacogr commented 1 year ago

It is not a "unavailable once, it is out" - if it is unreachable for 3 consecutive tries, it gets removed. So if it "sometimes" fails, it is all fine. Hence all the issues listing 1st, 2nd and 3rd - which refers to consecutive reports.

Anybody is free to re-enable with a PR if they feel the endpoint is reachable again.

Sadly, my position on this is fixed. It is what it is.

chrisdcosta commented 1 year ago

Also should add that I have looked at the logs, there might have been some heavy traffic, but again it does not look like the node was down. Perhaps the response time was slower, but again, it should not mean that the node endpoint should be deprecated or commented.

https://github.com/polkadot-js/apps/issues/9616

https://github.com/polkadot-js/apps/issues/9614

chrisdcosta commented 1 year ago

It is not a "unavailable once, it is out" - if it is unreachable for 3 consecutive tries, it gets removed. So if it "sometimes" fails, it is all fine. Hence all the issues listing 1st, 2nd and 3rd - which refers to consecutive reports.

Anybody is free to re-enable with a PR if they feel the endpoint is reachable again.

Sadly, my position on this is fixed. It is what it is.

This could just as easily be a problem with the CI not having enough bandwidth as anything else - the server was clearly not down!

My point is that you have not answered what point this serves you on Production?

Also for us to manually have to check if this endpoint has been deprecated or not and then issue a pull request is unmaintainable.

jacogr commented 1 year ago

This is the reality -

Teams stop endpoints and they never remove them here
Some teams never monitor, release and throw out there (it may or may not work)
The next we hear about it is that "this UI doesn't work" (cannot connect) and issues get logged

I'm doing the PRs manually (and sort the issues manually, aka track consecutive issues), since NOT doing so just leads to a stream of user support.

Things don't "just stop working". And they don't do so over 3 consecutive issues either.

I appreciate that you don't like this - but nothing about this is going to change, as mentioned above.

chrisdcosta commented 1 year ago

Ok I understand you want to keep issues to being related specifically to issue with apps, and not connectivity, but I think there must be a better solution. This feels wrong for the community not just the teams.

For example if we hadn't just made a partnership with Dwellir our parachain on the flagship Polkadot would have simply disappeared - that's not acceptable for any team.

Can you point me to your script and I will try to look at alternatives that address your concerns and ours?

jacogr commented 1 year ago

That is, fine, I hear your concerns, and I understand you think you are speaking to a brick wall here, but the auto-check won't go away.

It is always been proven to be correct, either due to -

Endpoints just being switched off
DNS just being dropped
Connection issues caused by load (there is a connection timeout in the tests)
Connection issues caused by configuration (i.e. if you only allow apps.polkadot.js.org it will show up since it comes form a different IP)
(Sadly) Some issues caused by upgrades where RPC has not behaved correctly
There are always transient errors, hence the 3 consecutive issues manual check, aka FAIL, FAIL, OK, FAIL (not 3 in a row) won't be marked, but FAIL, FAIL, FAIL (3 in a row) would be

With all that, I am always very happy to look at actual concrete PRs to tighten things up, extra eyes are always helpful.

Here is the actual tests that are being executed -

https://github.com/polkadot-js/apps/blob/master/packages/apps-config/src/ci/chainEndpoints.spec.ts

CI executes this via yarn ci:chainEndpoints which is just an alias to yarn polkadot-dev-run-test --env node packages/apps-config/src/ci/chainEndpoints with logging to file enabled.

jamesbayly commented 1 year ago

@jacogr is there a way that we can give you a contact to notify. e.g. If you get three consecutive fails, the you send an email notification to the contact (or perhaps assign a GitHub issue directly to the person that added the endpoint), and if you don't hear back in 24 hours you take the endpoint out.

The only problem with the current process is that this fails somewhat silently from our perspective, and unless we manually review endpoints each month, we don't know what has been removed or not. A tight time based notification flow would allow us to automatically rectify etc?

jacogr commented 1 year ago

If we add a config for DNS, we can match on those URLs it can then add the contact on a per issue basis.

So the config somewhere like -

// could also make this an array, which is probably preferable for larger providers
'*.api.onfinality.io': 'jamesbayly'

So the outputs can look something like

 x packages/apps-config/src/ci/chainEndpoints.spec.ts
   Bifrost @ wss://bifrost-polkadot.api.onfinality.io/public-ws

     testCodeFailure / ERR_TEST_FAILURE

     No DNS entry
         cc @jamesbayly

(Without the quotes in that case, so the GH CC logic takes effect)

It is slightly noisier than "just on issue 3", but those are not done automatically as of yet (there is a whole bunch on logic needed for that, and obviously needs to be consequtive), but at least the above would make a start and alert earlier?