murat-dogan / node-datachannel

WebRTC For Node.js and Electron. libdatachannel node bindings.
Mozilla Public License 2.0
297 stars 57 forks source link

Callbacks preventing garbage collection of objects #223

Closed achingbrain closed 4 months ago

achingbrain commented 7 months ago

This is a follow-on to #215, I think the problem still exists, even after node-datachannel@0.5.4.

These references can keep threads alive:

# ThreadSafeCallback callback
node:internal/async_hooks:202                                                                                          
file:///Users/alex/Documents/Workspaces/libp2p/js-libp2p/node_modules/node-datachannel/polyfill/RTCPeerConnection.js:63
file:///Users/alex/Documents/Workspaces/libp2p/js-libp2p/packages/transport-webrtc/dist/test/peer.spec.js:40           
file:///Users/alex/Documents/Workspaces/libp2p/js-libp2p/packages/transport-webrtc/dist/test/peer.spec.js:62           

# ThreadSafeCallback callback
node:internal/async_hooks:202                                                                                          
file:///Users/alex/Documents/Workspaces/libp2p/js-libp2p/node_modules/node-datachannel/polyfill/RTCPeerConnection.js:67
file:///Users/alex/Documents/Workspaces/libp2p/js-libp2p/packages/transport-webrtc/dist/test/peer.spec.js:40           
file:///Users/alex/Documents/Workspaces/libp2p/js-libp2p/packages/transport-webrtc/dist/test/peer.spec.js:62           

# ThreadSafeCallback callback
node:internal/async_hooks:202                                                                                          
file:///Users/alex/Documents/Workspaces/libp2p/js-libp2p/node_modules/node-datachannel/polyfill/RTCPeerConnection.js:71
file:///Users/alex/Documents/Workspaces/libp2p/js-libp2p/packages/transport-webrtc/dist/test/peer.spec.js:40           
file:///Users/alex/Documents/Workspaces/libp2p/js-libp2p/packages/transport-webrtc/dist/test/peer.spec.js:62           

# ThreadSafeCallback callback
node:internal/async_hooks:202                                                                                          
file:///Users/alex/Documents/Workspaces/libp2p/js-libp2p/node_modules/node-datachannel/polyfill/RTCPeerConnection.js:75
file:///Users/alex/Documents/Workspaces/libp2p/js-libp2p/packages/transport-webrtc/dist/test/peer.spec.js:40           
file:///Users/alex/Documents/Workspaces/libp2p/js-libp2p/packages/transport-webrtc/dist/test/peer.spec.js:62           

# ThreadSafeCallback callback
node:internal/async_hooks:202                                                                                          
file:///Users/alex/Documents/Workspaces/libp2p/js-libp2p/node_modules/node-datachannel/polyfill/RTCPeerConnection.js:79
file:///Users/alex/Documents/Workspaces/libp2p/js-libp2p/packages/transport-webrtc/dist/test/peer.spec.js:40           
file:///Users/alex/Documents/Workspaces/libp2p/js-libp2p/packages/transport-webrtc/dist/test/peer.spec.js:62           

# ThreadSafeCallback callback
node:internal/async_hooks:202                                                                                          
file:///Users/alex/Documents/Workspaces/libp2p/js-libp2p/node_modules/node-datachannel/polyfill/RTCPeerConnection.js:85
file:///Users/alex/Documents/Workspaces/libp2p/js-libp2p/packages/transport-webrtc/dist/test/peer.spec.js:40           
file:///Users/alex/Documents/Workspaces/libp2p/js-libp2p/packages/transport-webrtc/dist/test/peer.spec.js:62           

# ThreadSafeCallback callback
node:internal/async_hooks:202                                                                                          
file:///Users/alex/Documents/Workspaces/libp2p/js-libp2p/node_modules/node-datachannel/polyfill/RTCPeerConnection.js:95
file:///Users/alex/Documents/Workspaces/libp2p/js-libp2p/packages/transport-webrtc/dist/test/peer.spec.js:40           
file:///Users/alex/Documents/Workspaces/libp2p/js-libp2p/packages/transport-webrtc/dist/test/peer.spec.js:62   

The referenced lines in RTCPeerConnection are all callbacks registered with the C++ object, for example line 63:

# ThreadSafeCallback callback
node:internal/async_hooks:202                                                                                          
file:///Users/alex/.../node_modules/node-datachannel/polyfill/RTCPeerConnection.js:63

https://github.com/murat-dogan/node-datachannel/blob/d866015dea085164fa110e34d4e5d86a8cbaa050/polyfill/RTCPeerConnection.js#L63

If I change the .close method of RTCPeerConnection the problem goes away:

close() {
  // close all channels before shutting down
  this.#dataChannels.forEach((channel) => {
    channel.close();
    });

    this.#peerConnection.close();
    this.#peerConnection.destroy();  // <-- add this line
}

I think this is because doDestroy calls doResetCallbacks which releases references to the JS callbacks.

To replicate, clone js-libp2p, apply this diff:

diff --git a/packages/transport-webrtc/package.json b/packages/transport-webrtc/package.json
index b38cbc5aa..01f03204e 100644
--- a/packages/transport-webrtc/package.json
+++ b/packages/transport-webrtc/package.json
@@ -39,8 +39,8 @@
   "scripts": {
     "generate": "protons src/private-to-private/pb/message.proto src/pb/message.proto",
     "build": "aegir build",
-    "test": "aegir test -t node -t browser -- --exit",
-    "test:node": "aegir test -t node --cov -- --exit",
+    "test": "aegir test -t node -t browser",
+    "test:node": "aegir test -t node --cov",
     "test:chrome": "aegir test -t browser --cov",
     "test:firefox": "aegir test -t browser -- --browser firefox",
     "lint": "aegir lint",
@@ -65,7 +65,7 @@
     "it-stream-types": "^2.0.1",
     "multiformats": "^13.1.0",
     "multihashes": "^4.0.3",
-    "node-datachannel": "^0.5.3",
+    "node-datachannel": "^0.5.4",
     "p-defer": "^4.0.0",
     "p-event": "^6.0.0",
     "p-timeout": "^6.1.2",
@@ -73,7 +73,8 @@
     "race-signal": "^1.0.2",
     "react-native-webrtc": "^118.0.1",
     "uint8arraylist": "^2.4.8",
-    "uint8arrays": "^5.0.2"
+    "uint8arrays": "^5.0.2",
+    "why-is-node-running": "^2.2.2"
   },
   "devDependencies": {
     "@chainsafe/libp2p-yamux": "^6.0.2",
diff --git a/packages/transport-webrtc/test/basics.spec.ts b/packages/transport-webrtc/test/basics.spec.ts
index c54488a7a..c649a9312 100644
--- a/packages/transport-webrtc/test/basics.spec.ts
+++ b/packages/transport-webrtc/test/basics.spec.ts
@@ -1,5 +1,8 @@
 /* eslint-disable @typescript-eslint/no-unused-expressions */

+// @ts-expect-error no types
+import log from 'why-is-node-running'
+
 import { noise } from '@chainsafe/libp2p-noise'
 import { yamux } from '@chainsafe/libp2p-yamux'
 import { circuitRelayTransport } from '@libp2p/circuit-relay-v2'
@@ -19,6 +22,10 @@ import pRetry from 'p-retry'
 import { webRTC } from '../src/index.js'
 import type { Libp2p, Connection, Stream, StreamHandler } from '@libp2p/interface'

+setTimeout(() => {
+  log()
+}, 5000).unref()
+
 async function createNode (): Promise<Libp2p> {
   return createLibp2p({
     addresses: {

Then install & build and run the node tests:

% cd js-libp2p
% npm i && npm run build
% cd packages/transport-webrtc
% npm run test:node -- --grep 'should connect'

You should see one test run, a brief pause then a list of all the handles keeping the process running:

% npm run test:node -- --grep 'should connect'

> @libp2p/webrtc@4.0.21 test:node
> aegir test -t node --cov --grep should connect

build

> @libp2p/webrtc@4.0.21 build
> aegir build

[10:37:34] tsc [started]
[10:37:35] tsc [completed]
[10:37:35] esbuild [started]
[10:37:35] esbuild [completed]
test node.js
Warning: Cannot find any files matching pattern "test/node.{js,cjs,mjs}"
Warning: Cannot find any files matching pattern "test/**/*.spec.{js,cjs,mjs}"
Warning: Cannot find any files matching pattern "dist/test/node.{js,cjs,mjs}"

  webrtc basic
    ✔ should connect (84ms)

  1 passing (92ms)

There are 43 handle(s) keeping the process running

# FILEHANDLE
node:internal/async_hooks:202

... lots of output here

Ignore the FILEHANDLE and KEYPAIRGENREQUEST entries - it's the ThreadSafeCallback callback entries causing the problem.

achingbrain commented 7 months ago

It's worth noting that calling the exported cleanup function after the test run makes the process exit successfully.

From what I can see, cleanup comes from RTCWrapper which closes all currently open RTCPeerConnections, then resets all of their callbacks which lets the process exit.

Give that you can't re-open a RTCPeerConnection that's been closed, it seems reasonable to not expect any events to fire after it's been closed, so resetting all the callbacks on close (instead of just destroy) would be ok?

murat-dogan commented 7 months ago

Destroying or resetting callbacks will do the same.

But the point is that; if you call destroy or reset callbacks then you will not receive closed state change, which is not good. That is why I suggested calling cleanup when you don't need the lib.

achingbrain commented 7 months ago

Calling cleanup isn't a solution for a server-style application that may service hundreds of thousands or millions of connections during it's lifetime.

If I run this script with the --trace-gc node flag I can see that garbage collection occurs, but on my laptop it runs out of memory after a minute or so:

import { RTCPeerConnection } from 'node-datachannel/polyfill'

while (true) {
  let conn = new RTCPeerConnection()
  conn.close()
  conn = null
}

The WebRTC spec says that calling close on a RTCPeerConnection tears down any associated resources, so it should become garbage collectable after that.

If it doesn't then this is a memory leak.

murat-dogan commented 7 months ago

Hello,

If GC is running then it should call the destructor and reset all callbacks etc. But the absent point is we are not assigning the underlying object to null.

this.#peerConnection = null;

Could you please check and try this? https://github.com/murat-dogan/node-datachannel/pull/224

achingbrain commented 7 months ago

With #224 applied the above script still runs out of memory after a minute or so.

Interestingly I was trying to make it be kinder to the garbage collector by waiting for the connection to close before creating a new one with:

while (true) {
  const conn = new RTCPeerConnection()
  conn.close()

  await new Promise(resolve => {
    conn.addEventListener('connectionstatechange', () => {
      if (conn.connectionState === 'closed') {
        resolve()
      }
    })
  })
}

with this it runs for a few iterations then the process exits with code 13 and no output:

% node index.js
% echo $?
13

If I only create a single RTCPeerConnection #224 does let the process exit, though it takes a long time:

import { RTCPeerConnection } from 'node-datachannel/polyfill'

const conn = new RTCPeerConnection()
conn.addEventListener('connectionstatechange', () => {
  if (conn.connectionState === 'closed') {
    console.timeEnd('close')
  }
})

console.time('exit')
console.time('close')
conn.close()

process.addListener('exit', () => {
  console.timeEnd('exit')
  process.exit(0)
})

8 seconds to close!

% node index.js
close: 0.449ms
exit: 8.100s

Applying this diff to #224 makes the closing almost instant:

diff --git a/polyfill/RTCPeerConnection.js b/polyfill/RTCPeerConnection.js
index db3ae16..171c10e 100644
--- a/polyfill/RTCPeerConnection.js
+++ b/polyfill/RTCPeerConnection.js
@@ -61,6 +61,11 @@ export default class _RTCPeerConnection extends EventTarget {

         // forward peerConnection events
         this.#peerConnection.onStateChange(() => {
+            if (this.connectionState === 'closed') {
+                this.#peerConnection.destroy();
+                this.#peerConnection = null;
+            }
+
             this.dispatchEvent(new Event('connectionstatechange'));
         });

@@ -227,7 +232,6 @@ export default class _RTCPeerConnection extends EventTarget {
         });

         this.#peerConnection?.close();
-        this.#peerConnection = null;
     }

     createAnswer() {

E.g. instead of nulling this.#peerConnection in the .close() method, we wait for the connection state to change to 'closed', then reset the callbacks and remove the reference to the C++ object.

% node index.js
close: 0.436ms
exit: 5.802ms

5ms to close, much faster.


As an alternative to #224, this diff applied against master also solves the problem:

diff --git a/polyfill/RTCPeerConnection.js b/polyfill/RTCPeerConnection.js
index be9913b..08accf1 100644
--- a/polyfill/RTCPeerConnection.js
+++ b/polyfill/RTCPeerConnection.js
@@ -61,6 +61,10 @@ export default class _RTCPeerConnection extends EventTarget {

         // forward peerConnection events
         this.#peerConnection.onStateChange(() => {
+            if (this.connectionState === 'closed') {
+                this.#peerConnection.destroy();
+            }
+
             this.dispatchEvent(new Event('connectionstatechange'));
         });

Setting this.#peerConnection to null doesn't appear to be necessary, since the only reference to the C++ object is the internal field of the RTCPeerConnection instance, once that instance has no references it can be collected along with the underlying C++ object.

Really I think the root cause is the use of Napi::Persistent in ThreadSafeCallback - it's purpose is to prevent the wrapped object from being garbage collected so we have to tell it when we are done.

murat-dogan commented 7 months ago

@achingbrain Thanks for the detailed investigation.

To summarize things;

According to this info, we need to select an option.

murat-dogan commented 7 months ago

I tried with Chrome. If you call the close function of a peer-connection, the peer does not receive any closed event. The other peer receives a disconnected event.

So it seems we can call the destroy function directly within the close function. What do you say?

<!DOCTYPE html>
<html lang="en-US">

<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1">
    <title>RTCPeerConnection Example</title>
    <script>
        let peer1 = new RTCPeerConnection();
        let peer2 = new RTCPeerConnection();

        peer1.onicecandidate = ({candidate}) => candidate && peer2.addIceCandidate(candidate);
        peer2.onicecandidate = ({candidate}) => candidate && peer1.addIceCandidate(candidate);

        peer1.onconnectionstatechange = () => console.log('Peer1 State:', peer1.connectionState);
        peer2.onconnectionstatechange = () => console.log('Peer2 State:', peer2.connectionState);

        let channel = peer1.createDataChannel("mychannel");
        channel.onopen = () => console.log('Data channel is open');
        channel.onmessage = ({data}) => console.log('Received Message:', data);

        peer2.ondatachannel = ({channel}) => {
            channel.onmessage = ({data}) => console.log('Received Message:', data);
            channel.onopen = () => {
                console.log('Data channel is open');
                channel.send('Hello from Peer2!');
            };
        };

        peer1.createOffer().then(offer => {
            return peer1.setLocalDescription(offer);
        }).then(() => {
            return peer2.setRemoteDescription(peer1.localDescription);
        }).then(() => {
            return peer2.createAnswer();
        }).then(answer => {
            return peer2.setLocalDescription(answer);
        }).then(() => {
            return peer1.setRemoteDescription(peer2.localDescription);
        }).catch(console.error);

        setTimeout(() => {
            peer1.close();
        }, 2000);
    </script>
</head>

<body>
</body>

</html>
paullouisageneau commented 7 months ago

Indeed, with the Web API, RTCPeerConnection.close() is synchronous and does not fire any event (see specification). I think it makes sense to align the behavior and call the destroy function directly in the close function.

murat-dogan commented 7 months ago

I have just published v0.5.5 Could you please try it?

achingbrain commented 6 months ago

Sorry for the delay in replying, I've been AFK for a little while.

If I run:

import { RTCPeerConnection } from 'node-datachannel/polyfill'

const conn = new RTCPeerConnection()
conn.addEventListener('connectionstatechange', () => {
  if (conn.connectionState === 'closed') {
    console.timeEnd('close')
  }
})

console.time('exit')
console.time('close')
conn.close()

process.addListener('exit', () => {
  console.timeEnd('exit')
  process.exit(0)
})

The process now exits almost immediately, so the initially reported issue is fixed, thanks for pushing that release. 🎉 🚀

However this script still causes the process to run out of memory after a minute or so:

import { RTCPeerConnection } from 'node-datachannel/polyfill'

while (true) {
  const conn = new RTCPeerConnection()
  conn.close()
}

...and this script causes the process to terminate unexpectedly with a non-zero error code (it should run forever due to no connectionstatechange event being emitted):

import { RTCPeerConnection } from 'node-datachannel/polyfill'

while (true) {
  const conn = new RTCPeerConnection()
  conn.close()

  await new Promise(resolve => {
    conn.addEventListener('connectionstatechange', () => {
      if (conn.connectionState === 'closed') {
        resolve()
      }
    })
  })
}
murat-dogan commented 6 months ago
// test.js
import { RTCPeerConnection } from 'node-datachannel/polyfill'

let i=0;
while (i++<100*1000) {
  const conn = new RTCPeerConnection()
  conn.close()
}

Run Result

murat@murat-ThinkBook:~/js/node-datachannel$ free -h
               total        used        free      shared  buff/cache   available
Mem:            13Gi       3.7Gi       5.4Gi        90Mi       4.5Gi       9.5Gi
Swap:           14Gi       322Mi        14Gi
murat@murat-ThinkBook:~/js/node-datachannel$ node test.js 
murat@murat-ThinkBook:~/js/node-datachannel$ free -h
               total        used        free      shared  buff/cache   available
Mem:            13Gi       3.6Gi       5.4Gi        92Mi       4.5Gi       9.5Gi
Swap:           14Gi       322Mi        14Gi

Same result as @achingbrain for while(true). But not sure if this is really a problem. For me, it is related to GC

murat-dogan commented 5 months ago

I have just released v0.9.0 https://github.com/murat-dogan/node-datachannel/releases/tag/v0.9.0

Should we close also this issue?

murat-dogan commented 4 months ago

I am closing the issue. Please feel free to open if you need to.