Open chrisdcosta opened 1 year ago
See https://github.com/polkadot-js/apps/blob/1e10ba8883a59b9d4172210dba04259985eb85b7/packages/apps-config/src/endpoints/productionRelayPolkadot.ts#L346 (it is commented)
It has been tested as unreachable for 3 runs in a row (it runs twice daily).
So clicking through the link -
TL;DR if an endpoint is detected as unreachable in 3 runs, it gets marked as such with the link to the actual (final) detection run. This is not specific to this endpoint, there are a number that gets marked as such as soon as detected offline - https://github.com/polkadot-js/apps/pulls?q=is%3Apr+is%3Aclosed+unreachable
There are many reasons why a node could be unreachable from CI tests. Indeed EFF Lets Encrypt auto certificate renewal from within Github and Gitlab sometimes fails, and not because those services are "unavailable".
What purpose does it serve to comment the end point? Does it get uncommented when it is reachable again? If not, why not?
It is not a "unavailable once, it is out" - if it is unreachable for 3 consecutive tries, it gets removed. So if it "sometimes" fails, it is all fine. Hence all the issues listing 1st
, 2nd
and 3rd
- which refers to consecutive reports.
Anybody is free to re-enable with a PR if they feel the endpoint is reachable again.
Sadly, my position on this is fixed. It is what it is.
Also should add that I have looked at the logs, there might have been some heavy traffic, but again it does not look like the node was down. Perhaps the response time was slower, but again, it should not mean that the node endpoint should be deprecated or commented.
It is not a "unavailable once, it is out" - if it is unreachable for 3 consecutive tries, it gets removed. So if it "sometimes" fails, it is all fine. Hence all the issues listing
1st
,2nd
and3rd
- which refers to consecutive reports.Anybody is free to re-enable with a PR if they feel the endpoint is reachable again.
Sadly, my position on this is fixed. It is what it is.
This could just as easily be a problem with the CI not having enough bandwidth as anything else - the server was clearly not down!
My point is that you have not answered what point this serves you on Production?
Also for us to manually have to check if this endpoint has been deprecated or not and then issue a pull request is unmaintainable.
This is the reality -
I'm doing the PRs manually (and sort the issues manually, aka track consecutive issues), since NOT doing so just leads to a stream of user support.
Things don't "just stop working". And they don't do so over 3 consecutive issues either.
I appreciate that you don't like this - but nothing about this is going to change, as mentioned above.
Ok I understand you want to keep issues to being related specifically to issue with apps, and not connectivity, but I think there must be a better solution. This feels wrong for the community not just the teams.
For example if we hadn't just made a partnership with Dwellir our parachain on the flagship Polkadot would have simply disappeared - that's not acceptable for any team.
Can you point me to your script and I will try to look at alternatives that address your concerns and ours?
That is, fine, I hear your concerns, and I understand you think you are speaking to a brick wall here, but the auto-check won't go away.
It is always been proven to be correct, either due to -
FAIL, FAIL, OK, FAIL
(not 3 in a row) won't be marked, but FAIL, FAIL, FAIL
(3 in a row) would beWith all that, I am always very happy to look at actual concrete PRs to tighten things up, extra eyes are always helpful.
Here is the actual tests that are being executed -
https://github.com/polkadot-js/apps/blob/master/packages/apps-config/src/ci/chainEndpoints.spec.ts
CI executes this via yarn ci:chainEndpoints
which is just an alias to yarn polkadot-dev-run-test --env node packages/apps-config/src/ci/chainEndpoints
with logging to file enabled.
@jacogr is there a way that we can give you a contact to notify. e.g. If you get three consecutive fails, the you send an email notification to the contact (or perhaps assign a GitHub issue directly to the person that added the endpoint), and if you don't hear back in 24 hours you take the endpoint out.
The only problem with the current process is that this fails somewhat silently from our perspective, and unless we manually review endpoints each month, we don't know what has been removed or not. A tight time based notification flow would allow us to automatically rectify etc?
If we add a config for DNS, we can match on those URLs it can then add the contact on a per issue basis.
So the config somewhere like -
// could also make this an array, which is probably preferable for larger providers
'*.api.onfinality.io': 'jamesbayly'
So the outputs can look something like
x packages/apps-config/src/ci/chainEndpoints.spec.ts
Bifrost @ wss://bifrost-polkadot.api.onfinality.io/public-ws
testCodeFailure / ERR_TEST_FAILURE
No DNS entry
cc @jamesbayly
(Without the quotes in that case, so the GH CC logic takes effect)
It is slightly noisier than "just on issue 3", but those are not done automatically as of yet (there is a whole bunch on logic needed for that, and obviously needs to be consequtive), but at least the above would make a start and alert earlier?
I'm not sure why this is happening, but recently our production endpoint
wss://k-ui.kapex.network
has been removed from being accessible in the Apps UI.This appears to be a recent change which removes the endpoint from the list automatically if a script is not able to reach the end point. Our logs show no downtime.
This may or may not be a bug, but I think that in principle all production RPC endpoints should not be removed unless requested to do so by the teams or after having attempted to contact the teams directly in the event that an endpoint appears to be unresponsive.
We have third party wallet services that have not reported issues connecting to our endpoint.