uNetworking / uWebSockets.js

μWebSockets for Node.js back-ends :metal:
Apache License 2.0
8.08k stars 577 forks source link

Graceful shutdown on uncaught error #465

Closed jprochazk closed 3 years ago

jprochazk commented 3 years ago

I'm trying to use the library in a worker thread, but I can't seem to do a graceful shutdown when the worker dies due to an uncaught exception. I've tried as many ways to handle this case as I could find, I'm not sure if this is a mistake on my end, maybe I'm missing something.

I'm using Node 15.10.0, and uWS.js 18.14.0 on Win10, here's the code:

const uWS = require('uWebSockets.js');
const { Worker, isMainThread, threadId, parentPort } = require('worker_threads');
const os = require('os');
const port = 9001;
if (isMainThread) {
    const workers = /* os.cpus() */[0].map(() => {
        return new Worker(__filename);
    });
    function cleanup() {
        workers.forEach((worker) => {
            worker.postMessage('cleanup'); // too late
        });
    };
    process.on('uncaughtException', cleanup);
} else {
    let listenSocket;
    const app = uWS.App({}).listen(port, (token) => {
        listenSocket = token;
        if (token) {
            console.info(`${threadId} listening on ${port}`);
            performance.now(); // this will throw
        } else {
            console.error(`${threadId} failed listening on ${port}`);
        }
    });
    parentPort.on('message', (value) => {
        if (value === 'cleanup') {
            shutdown();
            process.exit(0);
        }
    });
    function shutdown() {
        if (listenSocket) {
            uWS.us_listen_socket_close(listenSocket);
            listenSocket = null;
        }
    }
}

The call to performace.now() is not valid, because performance must be imported from perf_hook, so it will throw a ReferenceError. The uncaught exception is handled by cleanup, but at this point, the worker is already terminated, so it can't respond to the cleanup message.

Here's the output:

1 listening on 9001
uv loop at [0000007BD97FF798] has open handles:
[000001D2F1E59F90] poll (active)
        Close callback: 0000000000000000
        Data: 0000000000000000
uv loop at [0000007BD97FF798] has 1 open handles in total
C:\Program Files\PowerShell\7\pwsh.exe[5036]: c:\ws\src\debug_utils.cc:322: Assertion `0 && "uv_loop_close() while having open handles"' failed.
 1: 00007FF779D5066F public: static int __cdecl v8::internal::Isolate::ArchiveSpacePerThread(void)+4607
 2: 00007FF779CE4A46 SSL_get_quiet_shutdown+65190
 3: 00007FF779CE4DC1 SSL_get_quiet_shutdown+66081
 4: 00007FF779D50334 public: static int __cdecl v8::internal::Isolate::ArchiveSpacePerThread(void)+3780
 5: 00007FF779C2287D public: void __cdecl v8::internal::wasm::SignatureMap::Freeze(void) __ptr64+35901
 6: 00007FF779C1DCB8 public: void __cdecl v8::internal::wasm::SignatureMap::Freeze(void) __ptr64+16504
 7: 00007FF779D985FD uv_poll_stop+557
 8: 00007FF77ABBE0B0 public: class v8::internal::compiler::Operator const * __ptr64 __cdecl v8::internal::compiler::RepresentationChanger::Uint32OverflowOperatorFor(enum v8::internal::compiler::IrOpcode::Value) __ptr64+398416
 9: 00007FFCA3027034 BaseThreadInitThunk+20
10: 00007FFCA3C1D0D1 RtlUserThreadStart+33

I looked through the issues, and it is similar to #324. Could this be the same problem?

hst-m commented 3 years ago

just to simplify the problem here is minimal example file

const {Worker,isMainThread}=require('worker_threads')
if(isMainThread){
    const worker=new Worker(__filename)
    worker.on('error',e=>console.log('ERROR',e))
    setInterval(()=>console.log('main is running'),1000)
}else{
    require('uWebSockets.js').App()
    //.listen(3001,s=>{})
    asd()
}

That works correctly, the last line throws error and the main thread catches it and continues running and the worker exits. But if you uncomment the Listen line, throwing the error after listening causes "uv loop open handles" process crash. Also replacing asd() with process.exit() does the same thing

hst-m commented 3 years ago

Ok in the previous issue https://github.com/uNetworking/uWebSockets.js/issues/324 they called uws.us_listen_socket_close(socket) before closing the thread which prevents the error, so for this issue to fix the problem you need to add in the worker thread (not the main thread):

process.on('uncaughtException',e=>{
   uws.us_listen_socket_close(listenSocket)
   wsSet.forEach(ws=>ws.close())
   process.exit(1)
})

See https://github.com/uNetworking/uWebSockets.js/discussions/575 to close websockets

Can this be fixed so the user does not have to add this in? AddEnvironmentCleanupHook

ghost commented 3 years ago

I don't really see this as an issue - if you leave sockets open and the process exists you are going to have polls remaining in the event loop. It can be solved by adding us_socket_context_close that simply just nukes all open sockets (listen sockets, sockets, client sockets) instantly. But then you get a lot of close events emitted during closedown. Maybe that is cleaner I don't know.

App.close() can call us_socket_context_close on all socket contexts and that App.close() can be called from JS inside that uncaughtException but really - why even catch uncaughtException? It's clearly an issue in the code and should "crash".

ghost commented 3 years ago

Unhandled exceptions are errors in code. In statically typed languages you can't even compile a program that doesn't catch exceptions (Java for instance). Sure, C++ defaults to aborting but even here you have noexcept functions that cannot throw so the idea of properly handling exceptions are widespread. JavaScript being a script can do anything it pleases at any point in time so of course there are no strict rules, but it still is a programming flaw.

ghost commented 3 years ago

To make it clear- I like that you have examples that can show the issue. I like that. But this is a cosmetic thing in my opinion.

jprochazk commented 3 years ago

I agree that unhandled exceptions are programmer mistakes. In case the worker runs into such an exception, the process is going to exit anyway, because there's probably no good way to recover. It'd be better not to crash on exit, but it's not a deal-breaker by any means.

hst-m commented 3 years ago

App.close() can call us_socket_context_close on all socket contexts and that App.close() can be called from JS inside that uncaughtException

I thought you could add whatever is needed inside AddEnvironmentCleanupHook instead of uncaughtException, closing whatever sockets are open etc

ghost commented 3 years ago

Probably so, yes

RJ4100 commented 3 years ago

App.close() can call us_socket_context_close on all socket contexts and that App.close() can be called from JS inside that uncaughtException

I thought you could add whatever is needed inside AddEnvironmentCleanupHook instead of uncaughtException, closing whatever sockets are open etc

@hst-m I find using cluster instead of workerthreads a viable option. In you sample code from #324 only one worker is used by the cluster. When I tried to add two or more the rest exits with error Error: write EPIPE. Do you have an example where you can have a lot workers?

RJ4100 commented 3 years ago

Catching exceptions inside workers is good practice but I like to learn more about using cluster vs workerthreads. @hst-m what do you like about cluster?

hst-m commented 3 years ago

here is same thing in each:

Cluster:

const cluster=require('cluster')
if(cluster.isMaster){
   const shutdown=sig=>{}
   process.on('SIGINT',shutdown);process.on('SIGTERM',shutdown)
   require('os').cpus().forEach(c=>cluster.fork())
}else{
   let listenSocket,uws=require('uWebSockets.js'),port=3000
   uws.App().get('/',res=>res.end(`worker ${cluster.worker.id}`))
   .listen(port,s=>{listenSocket=s;s&&console.log(`worker ${cluster.worker.id} listening on ${port}`)})

   const shutdown=sig=>{
      uws.us_listen_socket_close(listenSocket)
      cluster.worker.kill()
   }
   process.on('SIGINT',shutdown);process.on('SIGTERM',shutdown)
}

Worker Thread:

const {Worker,isMainThread,threadId,parentPort}=require('worker_threads')
if(isMainThread){
   const shutdown=sig=>workers.forEach(w=>w.postMessage('shutdown'))
   process.on('SIGINT',shutdown);process.on('SIGTERM',shutdown)
   const workers=require('os').cpus().map(c=>new Worker(__filename))
}else{
   let listenSocket,uws=require('uWebSockets.js'),port=3000
   uws.App().get('/',res=>res.end(`thread ${threadId}`))
   .listen(port,s=>{listenSocket=s;s&&console.log(`thread ${threadId} listening on ${port}`)})

   parentPort.on('message',message=>{
      if(message=='shutdown'){
         uws.us_listen_socket_close(listenSocket)
         process.exit(0)
      }
   })
}
RJ4100 commented 3 years ago

@hst-m Thanks. It looks good. I need to understand differences between cluster and the worker-threads. I have a feeling cluster is more mature but not sure.

ghost commented 3 years ago

Cluster is processes workers are threads. With workers you have only one process so a crash ends all workers

hst-m commented 3 years ago

I think the main benefit of threads are they are lighter to create and can share data between them faster. So maybe if you are sending lots of data between you should use threads. Just know a crash will crash the whole thing, though you can catch errors in JS to prevent it crashing in either main or worker thread

RJ4100 commented 3 years ago

I'm using redis to distribute data between threads and servers. One error keeps happening at night when the server is running under small load. Basically connected clients are sending pings only.

Error in master Error [ERR_UNHANDLED_ERROR]: Unhandled error. ('Invalid access of closed uWS.WebSocket/SSLWebSocket.') at Worker.emit (events.js:304:17) at Worker.EventEmitter.emit (domain.js:467:12) at Worker.[kOnErrorMessage] (internal/worker.js:251:10) at Worker.[kOnMessage] (internal/worker.js:261:37) at MessagePort. (internal/worker.js:177:57) at MessagePort.[nodejs.internal.kHybridDispatch] (internal/event_target.js:354:41) at MessagePort.exports.emitMessage (internal/per_context/messageport.js:18:26) { code: 'ERR_UNHANDLED_ERROR', context: 'Invalid access of closed uWS.WebSocket/SSLWebSocket.' } Tue Mar 02 2021 03:19:29 GMT-0500 (GMT-05:00)

RJ4100 commented 3 years ago

In the test environment, this unhandled exception is caught in master. Not sure why the socket closes as the ping is used to keepAlive the sockets.

uv loop at [0x7000124c0af8] has open handles: [0x105765960] poll (active) Close callback: 0x0 Data: 0x0 [0x105653350] poll (active) Close callback: 0x1055561a0 close_cb_free [.../node_modules/uWebSockets.js/uws_darwin_x64_83.node] Data: 0x0 [0x10474d260] poll (active) Close callback: 0x0 Data: 0x0 uv loop at [0x7000124c0af8] has 3 open handles in total node[42120]: ../src/debug_utils.cc:322:void node::CheckedUvLoopClose(uv_loop_t ): Assertion `0 && "uv_loop_close() while having open handles"' failed. 1: 0x1012d96a5 node::Abort() (.cold.1) [/usr/local/bin/node] 2: 0x1000a6309 node::Abort() [/usr/local/bin/node] 3: 0x1000a6171 node::Assert(node::AssertionInfo const&) [/usr/local/bin/node] 4: 0x10003117e node::CheckedUvLoopClose(uv_loop_s) [/usr/local/bin/node] 5: 0x1001457c6 node::worker::WorkerThreadData::~WorkerThreadData() [/usr/local/bin/node] 6: 0x1001421a1 node::worker::Worker::Run() [/usr/local/bin/node] 7: 0x100145952 node::worker::Worker::StartThread(v8::FunctionCallbackInfo const&)::$_3::__invoke(void*) [/usr/local/bin/node] 8: 0x7fff6fdf0109 _pthread_start [/usr/lib/system/libsystem_pthread.dylib] 9: 0x7fff6fdebb8b thread_start [/usr/lib/system/libsystem_pthread.dylib]

RJ4100 commented 3 years ago

@hst-m What is the best way to send data between threads? If all the clients disconnect I assume the workers would crash, correct? If my assumption is correct, it would be nice to have a dummy client in master that connects to each worker and keeps it alive even though there are no real clients. If a browser is unresponsive it will stop sending pings and this might close all sockets and create these exceptions.

hst-m commented 3 years ago

What is the best way to send data between threads?

Read all of worker threads page, there is 2-way message channels and new broadcast channels, also SharedArrayBuffer

One error keeps happening 'Invalid access of closed uWS.WebSocket' If all the clients disconnect I assume the workers would crash, correct?

You need to attach event handler to ws.close event that stops code from trying to access websocket after it closes. There is no reason for crash if clients disconnect

RJ4100 commented 3 years ago

@hst-m Thank you. I will read everything first.

As for the crash, right now I have an empty close handler:

).ws('/*', {
    ....
    ,close: (ws, code, message) => {
    }
})

Should I add inside the close handler something like ws.isClosed = true and then check before using ws if ws.isClosed?

hst-m commented 3 years ago

Should I add inside the close handler something like ws.isClosed = true and then check before using ws if ws.isClosed?

You could but keeping references to the ws object will keep it in memory wasting resources when really you should be deleting any references to the ws object and let it garbage collect, remove it from whatever list you are using to track ws objects and just not call it anymore. You must be tracking the ws object in some list so remove it from list instead of setting ws.isClosed=true. If you need to keep it around temporarily for some reason then is ok

RJ4100 commented 3 years ago

@hst-m I'm not keeping any references to ws. I use app.publish to send data periodically inside setInterval. I use only ws when client connects inside open handler to send the initial data. Clients are sending data and app.publish is used again to broadcast right away. I'm assuming the socket closes and app.publish attempts to send data on the closed socket.

hst-m commented 3 years ago

No app.publish would not throw that error Invalid access of closed uWS.WebSocket so you must be doing something else. Just above that error log you posted should be the exact code that caused error, look up that code and see what it is doing, you cut it out from your post

hst-m commented 3 years ago

Maybe catching in Master thread is preventing showing the code that caused it, remove the master thread error catch and let it throw in worker? Because your error looks different like its caught in master and not showing error like my test, unless you cut out the code part

RJ4100 commented 3 years ago

Okay let me try it. I will remove the error handler in the master and try again. The issue happens at night not sure how to reproduce it.

This is the handler in master: process.on('uncaughtException', (e)=>{ console.log("Error in master", e, new Date().toString())});

Before I added the error handler in master I just got this error in worker: uv loop at [0x7000124c0af8] has open handles: [0x105765960] poll (active) Close callback: 0x0 Data: 0x0 [0x105653350] poll (active) Close callback: 0x1055561a0 close_cb_free [.../node_modules/uWebSockets.js/uws_darwin_x64_83.node] Data: 0x0 [0x10474d260] poll (active) Close callback: 0x0 Data: 0x0 uv loop at [0x7000124c0af8] has 3 open handles in total node[42120]: ../src/debug_utils.cc:322:void node::CheckedUvLoopClose(uv_loop_t ): Assertion `0 && "uv_loop_close() while having open handles"' failed. 1: 0x1012d96a5 node::Abort() (.cold.1) [/usr/local/bin/node] 2: 0x1000a6309 node::Abort() [/usr/local/bin/node] 3: 0x1000a6171 node::Assert(node::AssertionInfo const&) [/usr/local/bin/node] 4: 0x10003117e node::CheckedUvLoopClose(uv_loop_s) [/usr/local/bin/node] 5: 0x1001457c6 node::worker::WorkerThreadData::~WorkerThreadData() [/usr/local/bin/node] 6: 0x1001421a1 node::worker::Worker::Run() [/usr/local/bin/node] 7: 0x100145952 node::worker::Worker::StartThread(v8::FunctionCallbackInfo const&)::$_3::__invoke(void*) [/usr/local/bin/node] 8: 0x7fff6fdf0109 _pthread_start [/usr/lib/system/libsystem_pthread.dylib] 9: 0x7fff6fdebb8b thread_start [/usr/lib/system/libsystem_pthread.dylib] zsh: abort node app.js

Before these errors I only see database error which is handled in connection.on('error',()=>{}). The called to db is made from inside the ws.message handler. It could be that a request to db never completed and and some point that error is generated however I have try-catch around the db call.

hst-m commented 3 years ago

ok I just set up similar test to try this, I can't get it to print the exact code location causing Invalid access error from inside the worker thread, but just know you are doing something invalid with ws object after it closed, should be easy to find, post full mini example code here if you need help. It would help if this issue of open sockets/handles would be fixed so it doesnt crash all the other threads AddEnvironmentCleanupHook

hst-m commented 3 years ago

You could remove the master thread and run your worker as a single thread and it should show you the exact code throwing the invalid access error

RJ4100 commented 3 years ago

@hst-m Thank you.I will try tonight to find if the same error occurs. I removed the handler in master. Hopefully it will show where it's coming from. If not, I will start digging. It looks like mysql pool connection is throwing errors and somehow creates the crash. Maybe callback arrives so late that ws is closed. Not sure how to create a small test for this as it has a lot of pieces. ws.open is using ws, ws.message uses db&ws to fetch and return data and redis publish. Redis subscriber uses app.publish.

RJ4100 commented 3 years ago

@hst-m Good point. I'll do that just use one thread try to find where it is.

RJ4100 commented 3 years ago

@hst-m How do you handle the case where you have a really long callback and data comes back after ws closed?

hst-m commented 3 years ago

I guess that would be a case where ws.isClosed=true in the ws.close handler would work, so that you don't do ws.send() or any other ws function when the callback returns

RJ4100 commented 3 years ago

@hst-m Yeah I thought this could be the fix. I'm running bunch of stuff and I guess sometimes a request is dropped. I can definitely hear the cpu running :)

RJ4100 commented 3 years ago

@hst-m I have my hopes up will know tomorrow. Thanks again.

ghost commented 3 years ago

Should I add inside the close handler something like ws.isClosed = true and then check before using ws if ws.isClosed?

Philosophy of JavaScript - don't care about garbage, throw it on the ground until the world collapses.

image

Pilosophy of C++ - take care of garbage the moment it becomes garbage to avoid wading knee deep in shit:

image

hst-m commented 3 years ago

Nice pictorial illustrations but we are talking about a split second between when the async call gets made and when it returns while the ws object is still in memory, not too much garbage to complain about, also the ws object might still be needed when the async returns so no point collecting it. Maybe if your C++ actually cleaned up its sockets/handles when the thread exited we wouldn't have this issue open

ghost commented 3 years ago

Just a little more won't do any harm 😂

https://youtu.be/wVnMBGXVVUI

hst-m commented 3 years ago

Here is a picture of your C++ after not cleaning up sockets and handles and crashing every other thread in the program

wtc3b

ghost commented 3 years ago

Hahaha damn boy

hst-m commented 3 years ago

I think my preferred setup right now is use both Cluster and Worker Threads, because I use Cluster to manage and launch different versions of app with 1 Process/version running at a time, leave old process running until new process is ready, then each process can launch Worker threads if needed for each cpu, then if something crashes I can still have Cluster launch a new process

fringe-chicago-skyline

RJ4100 commented 3 years ago

Impressive pics I love the last one as it represents sophisticated infrastructure. Great setup!

RJ4100 commented 3 years ago

ws.isClosed saved the day!

hst-m commented 3 years ago

ws.message uses db&ws to fetch and return data

ya this part you said message event does some async then sends back data ws.send() so anything async like that where the ws could have closed you would need to do ws.ab=true in the close handler and then check it !ws.ab&&ws.send() OR do a try/catch block to catch the ws.send error try{ws.send()}catch(e){}

RJ4100 commented 3 years ago

Yep. I like to avoid try and catch as much as possible in early development and then add try&catch at the end around bigger pieces to take into account everything else the code could've missed.

One other thingy is that I find client disconnects during night although I send ping every 50 seconds and keepAlive is 60. Should the server be sending pong back to keep the client alive? If let's say client script because unresponsive is there a way to refresh it like window reload?

ghost commented 3 years ago

I'm trying to use the library in a worker thread, but I can't seem to do a graceful shutdown when the worker dies due to an uncaught exception

And

Yep. I like to avoid try and catch as much as possible in early development

Might be related? No? I would say go fix your app and don't expect software to magically solve your mistakes "gracefully".

RJ4100 commented 3 years ago

@alexhultman Yeah absolutely looking from js perspective.

hst-m commented 3 years ago

Might be related? No? I would say go fix your app and don't expect software to magically solve your mistakes "gracefully".

you quoted 2 different people :P

RJ4100 commented 3 years ago

Haha that perspective is dominating :)

RJ4100 commented 3 years ago

Kidding aside js is a cookie eater. I'm thinking to switch to C++ for version 2.

hst-m commented 3 years ago

Ideally you don't have any errors but if the thread exits it would be nice to not drag the process down with it. I did more testing and doing as mentioned above uws.us_listen_socket_close(listenSocket) won't prevent crash, after websocket opens there is still some handle/poll open. uWS could clean up whatever is open in AddEnvironmentCleanupHook but maybe that is difficult to do, I don't know how it works

RJ4100 commented 3 years ago

I think call to ws.send should just be dropped if for broken handles but yeah it's maybe it would look like spaghetti inside C++

hst-m commented 3 years ago

This issue does not require any errors, if you just want to exit thread with process.exit(0) you can't do it without a crash, after websockets have been opened etc. So this talk about handling errors is not relevant, still need a way to clean up thread if you want to exit