verida / vault-auth-server

ISC License
0 stars 1 forks source link

Better handling of Ceramic Gateway errors #9

Closed tahpot closed 3 years ago

tahpot commented 3 years ago

Ceramic seems to throw random 502 Bad Gateway error messages.

See here:

/data/apps/verida-js/node_modules/@ceramicnetwork/http-client/node_modules/rxjs/dist/cjs/internal/util/reportUnhandledError.js:13
            throw err;
            ^

Error: HTTP request to 'https://ceramic-clay.3boxlabs.com/api/v0/streams/k2t6wyfsu4pg08bfcq1kq7ceunq5392pw4w4mzwkasflbc6u9w4yd5z5hcn1ay?sync=0' failed with status 'Bad Gateway': <html>
<head><title>502 Bad Gateway</title></head>
<body>
<center><h1>502 Bad Gateway</h1></center>
</body>
</html>

    at Object.fetchJson (/data/apps/verida-js/node_modules/@ceramicnetwork/common/src/utils/http-utils.ts:19:11)
    at runMicrotasks (<anonymous>)
    at processTicksAndRejections (internal/process/task_queues.js:94:5)
    at Function._load (/data/apps/verida-js/node_modules/@ceramicnetwork/http-client/src/document.ts:109:23)
    at Document._syncState (/data/apps/verida-js/node_modules/@ceramicnetwork/http-client/src/document.ts:59:19)
[nodemon] app crashed - waiting for file changes before starting...

Need to explore how to handle these and eventually log / track them for monitoring purposes.

Need to consider spinning up our own Ceramic testnet infrastructure.

tahpot commented 3 years ago

I think this is resolved, but will wait and see how it goes in the test environment before closing this issue.

tahpot commented 3 years ago

No resolved, still occurring.

tahpot commented 3 years ago

Testnet has been updated with this branch to try and catch this error:

https://github.com/verida/vault-auth-server/tree/bug/9-ceramic-gateway-error

tahpot commented 3 years ago

Still not working. Slightly different error crashed it this time (503 - Service Temporarily Unavailable):

/data/apps/verida-js/node_modules/@ceramicnetwork/http-client/node_modules/rxjs/dist/cjs/internal/util/reportUnhandledError.js:13
            throw err;
            ^

Error: HTTP request to 'https://ceramic-clay.3boxlabs.com/api/v0/streams/kjzl6cwe1jw1482051achqlcwqlsuhj77y80npr9u3ay3ri5d8k4r5ngmz3pc9x?sync=0' failed with status 'Service Temporarily Unavailable': <html>
<head><title>503 Service Temporarily Unavailable</title></head>
<body>
<center><h1>503 Service Temporarily Unavailable</h1></center>
</body>
</html>

    at Object.fetchJson (/data/apps/verida-js/node_modules/@ceramicnetwork/common/src/utils/http-utils.ts:19:11)
    at runMicrotasks (<anonymous>)
    at processTicksAndRejections (internal/process/task_queues.js:94:5)
    at Function._load (/data/apps/verida-js/node_modules/@ceramicnetwork/http-client/src/document.ts:109:23)
    at Document._syncState (/data/apps/verida-js/node_modules/@ceramicnetwork/http-client/src/document.ts:59:19)
[nodemon] app crashed - waiting for file changes before starting...
tahpot commented 3 years ago

For now, will start the service using PM2 so it will auto-restart.

https://pm2.keymetrics.io/

tahpot commented 3 years ago

PM2 is installed on testnet server.

Example script added to the feature branch: https://github.com/verida/vault-auth-server/blob/bug/9-ceramic-gateway-error/prod.sh

Logs can be checked by:

cat ~/.pm2/logs/vault-auth-server-error.log
cat ~/.pm2/logs/vault-auth-server-out.log
tahpot commented 3 years ago

That still didn't work.

The server crashed and PM2 didn't detect it and restart.

I suspect the issue is the server is being run via nodemon which doesn't actually die.

Have created a new babel build process and a package.json script to run the server directly via node instead of nodemon.

This is now running on testnet.

tahpot commented 3 years ago

This still didn't work either.

The server crashed and PM2 still didn't detect it and restart.

tahpot commented 3 years ago

Logs indicated contextConfig was undefined after catching a Ceramic error which caused the crash.

Have applied a fix and deployed to testnet.

tahpot commented 3 years ago

There were multiple issues, however I believe all the errors are now being caught and handled with better error messages which prevents the application from crashing.

It appears pm2 doesn't detect when the app actually crashes, however it's a useful tool so I'll leave it as the recommended way to run the server.