open-telemetry / opentelemetry-js-contrib

OpenTelemetry instrumentation for JavaScript modules
https://opentelemetry.io
Apache License 2.0
691 stars 507 forks source link

Memory leak in instrumentation-pg #2479

Closed yandrushchak closed 18 minutes ago

yandrushchak commented 5 days ago

What version of OpenTelemetry are you using?

@opentelemetry/instrumentation-pg: 0.45.1

What version of Node are you using?

v20.15.1

What did you do?

Used @opentelemetry/instrumentation-pg with TypeORM and pg-pool. Received warning about possible memory leak, which happens since listeners are constantly attached for each query. Partial stack trace:

MaxListenersExceededWarning: Possible EventEmitter memory leak detected. 11 connect listeners added to [BoundPool]. MaxListeners is 10. Use emitter.setMaxListeners() to increase limit
    at genericNodeError (node:internal/errors:984:15)
    at wrappedFn (node:internal/errors:538:14)
    at _addListener (node:events:593:17)
    at BoundPool.addListener (node:events:611:10)
    at BoundPool.connect (c:/test/node_modules\@opentelemetry\instrumentation-pg\src\instrumentation.ts:461:12)
    at <anonymous> (c:/test/node_modules\typeorm\driver\src\driver\postgres\PostgresDriver.ts:1175:25)
    at new Promise (<anonymous>)
    at PostgresDriver.obtainMasterConnection (c:/test/node_modules\typeorm\driver\src\driver\postgres\PostgresDriver.ts:1174:16)
    at PostgresQueryRunner.connect (c:/test/node_modules\typeorm\driver\src\driver\postgres\PostgresQueryRunner.ts:111:18)
    at PostgresQueryRunner.query (c:/test/node_modules\typeorm\driver\src\driver\postgres\PostgresQueryRunner.ts:248:47)
....

What did you expect to see?

No memory leak.

Additional context

This issue was introduced in this commit - https://github.com/open-telemetry/opentelemetry-js-contrib/commit/12adb4354f09ade438cd96340bdfd1f715b5fed3#diff-1360de1aed58c38c5cbd3a1759a8af761049ea2c562a477964623fb2358b2730R379 Since pg.Pool.connect is called for each query to obtain connection, duplicated event listeners are attached each time, which would probably also lead to wrong metrics data.

It seems in order to fix it, additional check is needed to make sure event listeners are added only on a first connect call.

astrot1988 commented 1 day ago

same issue

clintonb commented 1 day ago

We observed this issue after upgrading from @opentelemetry/auto-instrumentations-node 0.46.1 to 0.50.1.

clintonb commented 1 day ago

Here is a stack trace from our container startup. The line in instrumentation.js corresponds with:

https://github.com/open-telemetry/opentelemetry-js-contrib/blob/ad560dfc5f662d418a74b0b197b3f48e4ae002d0/plugins/node/opentelemetry-instrumentation-pg/src/instrumentation.ts#L454-L461

(node:1) MaxListenersExceededWarning: Possible EventEmitter memory leak detected. 11 remove listeners added to [BoundPool]. MaxListeners is 10. Use emitter.setMaxListeners() to increase limit
    at new Promise (<anonymous>)
    at /srv/node_modules/.pnpm/@nestjs+terminus@10.2.3_drcoupryhn6e72cgxfoljijmlu/node_modules/@nestjs/terminus/dist/health-indicator/database/typeorm.health.js:17:71
    at Generator.next (<anonymous>)
    at TypeOrmHealthIndicator.<anonymous> (/srv/node_modules/.pnpm/@nestjs+terminus@10.2.3_drcoupryhn6e72cgxfoljijmlu/node_modules/@nestjs/terminus/dist/health-indicator/database/typeorm.health.js:134:28)
    at TypeOrmHealthIndicator.pingDb (/srv/node_modules/.pnpm/@nestjs+terminus@10.2.3_drcoupryhn6e72cgxfoljijmlu/node_modules/@nestjs/terminus/dist/health-indicator/database/typeorm.health.js:94:16)
    at __awaiter (/srv/node_modules/.pnpm/@nestjs+terminus@10.2.3_drcoupryhn6e72cgxfoljijmlu/node_modules/@nestjs/terminus/dist/health-indicator/database/typeorm.health.js:13:12)
    at new Promise (<anonymous>)
    at /srv/node_modules/.pnpm/@nestjs+terminus@10.2.3_drcoupryhn6e72cgxfoljijmlu/node_modules/@nestjs/terminus/dist/health-indicator/database/typeorm.health.js:17:71
    at Generator.next (<anonymous>)
    at TypeOrmHealthIndicator.<anonymous> (/srv/node_modules/.pnpm/@nestjs+terminus@10.2.3_drcoupryhn6e72cgxfoljijmlu/node_modules/@nestjs/terminus/dist/health-indicator/database/typeorm.health.js:107:40)
    at DataSource.<anonymous> (/srv/node_modules/.pnpm/opentelemetry-instrumentation-typeorm@0.41.0_@opentelemetry+api@1.9.0/node_modules/opentelemetry-instrumentation-typeorm/dist/src/typeorm.js:267:42)
    at ContextAPI.with (/srv/node_modules/.pnpm/@opentelemetry+api@1.9.0/node_modules/@opentelemetry/api/build/src/api/context.js:60:46)
    at AsyncLocalStorageContextManager.with (/srv/node_modules/.pnpm/@opentelemetry+context-async-hooks@1.26.0_@opentelemetry+api@1.9.0/node_modules/@opentelemetry/context-async-hooks/build/src/AsyncLocalStorageContextManager.js:33:40)
    at AsyncLocalStorage.run (node:async_hooks:346:14)
    at /srv/node_modules/.pnpm/opentelemetry-instrumentation-typeorm@0.41.0_@opentelemetry+api@1.9.0/node_modules/opentelemetry-instrumentation-typeorm/dist/src/typeorm.js:267:82
    at TypeormInstrumentation._endSpan (/srv/node_modules/.pnpm/opentelemetry-instrumentation-typeorm@0.41.0_@opentelemetry+api@1.9.0/node_modules/opentelemetry-instrumentation-typeorm/dist/src/typeorm.js:283:30)
    at /srv/node_modules/.pnpm/opentelemetry-instrumentation-typeorm@0.41.0_@opentelemetry+api@1.9.0/node_modules/opentelemetry-instrumentation-typeorm/dist/src/typeorm.js:267:106
    at DataSource.query (/srv/node_modules/.pnpm/typeorm@0.3.17_better-sqlite3@9.4.0_ioredis@5.3.2_pg-query-stream@4.5.3_pg@8.11.3__pg@8.11.3__nbwcw7yzxgjrpbcbytvl7a4bfy/node_modules/typeorm/data-source/DataSource.js:345:42)
    at PostgresQueryRunner.query (/srv/node_modules/.pnpm/typeorm@0.3.17_better-sqlite3@9.4.0_ioredis@5.3.2_pg-query-stream@4.5.3_pg@8.11.3__pg@8.11.3__nbwcw7yzxgjrpbcbytvl7a4bfy/node_modules/typeorm/driver/postgres/PostgresQueryRunner.js:174:47)
    at PostgresQueryRunner.connect (/srv/node_modules/.pnpm/typeorm@0.3.17_better-sqlite3@9.4.0_ioredis@5.3.2_pg-query-stream@4.5.3_pg@8.11.3__pg@8.11.3__nbwcw7yzxgjrpbcbytvl7a4bfy/node_modules/typeorm/driver/postgres/PostgresQueryRunner.js:68:18)
    at PostgresDriver.obtainMasterConnection (/srv/node_modules/.pnpm/typeorm@0.3.17_better-sqlite3@9.4.0_ioredis@5.3.2_pg-query-stream@4.5.3_pg@8.11.3__pg@8.11.3__nbwcw7yzxgjrpbcbytvl7a4bfy/node_modules/typeorm/driver/postgres/PostgresDriver.js:870:16)
    at new Promise (<anonymous>)
    at /srv/node_modules/.pnpm/typeorm@0.3.17_better-sqlite3@9.4.0_ioredis@5.3.2_pg-query-stream@4.5.3_pg@8.11.3__pg@8.11.3__nbwcw7yzxgjrpbcbytvl7a4bfy/node_modules/typeorm/driver/postgres/PostgresDriver.js:871:25
    at BoundPool.connect (/srv/node_modules/.pnpm/@opentelemetry+instrumentation-pg@0.45.1_@opentelemetry+api@1.9.0/node_modules/@opentelemetry/instrumentation-pg/build/src/instrumentation.js:306:22)
    at BoundPool.addListener (node:events:611:10)
    at _addListener (node:events:593:17)
    at wrappedFn (node:internal/errors:538:14)
    at genericNodeError (node:internal/errors:984:15)
thecopy commented 1 day ago

Same issue, our database calls are now very very slow as well. Downgrading to auto-instrumentations-node: v0.48.0 solved this issue for us

yandrushchak commented 1 day ago

As a workaround, downgrade of auto-instrumentation-node to 0.50.0 solves it, since it references "@opentelemetry/instrumentation-pg": "^0.44.0", which doesn't have this issue.

pichlermarc commented 1 day ago

FYI @maryliag (component owner)

maryliag commented 1 day ago

Hi everyone, thank you for raising this issue! Indeed I made a mistake and was adding several event listeners to the same pool. Although this is obviously not ideal, I want to clarify the comment

duplicated event listeners are attached each time, which would probably also lead to wrong metrics data.

This (wrong metric data) is not happening, because I did noticed the updater was getting called a few times, so I made sure it always sends the correct value, event if called multiple times. I just didn't realize the reason why it was getting called multiple times 😓

Thanks to the input from this issue, I was able to create a fix for it. Also some minor performance improvements on https://github.com/open-telemetry/opentelemetry-js-contrib/pull/2484

If you continue seeing this issue on a version with this fix, please let me know!

zuk commented 23 hours ago

Thanks @maryliag !

We've been seeing some serious performance issues that coincidentally began right after we updated our opentelemetry deps.

Specifically, over the course of several hours, db-pool.connect durations would get slower and slower, eventually growing from a few ms to 2+ seconds until eventually the whole process would come to a halt, waiting for db connections to open. Rolling back @opentelemetry/auto-instrumentations-node ^0.50.1 -> 0.48.0 fixed the problem for now.

julioantonini commented 22 hours ago

Hi guys! I was facing the same issue, so for now, I downgraded the @opentelemetry/auto-instrumentations-node to version 0.49.0. This temporarily fixed the problem until a hotfix is released

"@opentelemetry/auto-instrumentations-node": "0.49.0"

maryliag commented 18 hours ago

durations would get slower and slower, eventually growing from a few ms to 2+ seconds until eventually the whole process would come to a halt

this seems to align with a new listeners being added on every new connection, which is what the PR I created is fixing, so hopefully this will solve your problem. I appreciate everybody's patience until this gets merged and added to a new release 🙏 And I apologize for any problems this might have caused.

Also, hope the new metrics related to pool will be helpful 😄

peterabbott commented 14 hours ago

Expereienced this too, had to rollback to 0.44.0 of instrumentation-pg. Caused a significant performance impact on one of our systems.

peterabbott commented 14 hours ago

To add, when using the event listeners, they should be a matching remove to avoid this.

https://github.com/open-telemetry/opentelemetry-js-contrib/commit/12adb4354f09ade438cd96340bdfd1f715b5fed3#diff-1360de1aed58c3[…]4623fb2358b2730R379

It would be nice of the metrics could be an opt-in

maryliag commented 12 hours ago

they should be a matching remove to avoid this

when do you suggest the removal would occur?

It would be nice of the metrics could be an opt-in

I updated the metrics generation to be under a feature flag OTEL_SEMCONV_STABILITY_OPT_IN on the PR I mentioned above , which will need to be database or database/dup