Closed yandrushchak closed 18 minutes ago
same issue
We observed this issue after upgrading from @opentelemetry/auto-instrumentations-node
0.46.1
to 0.50.1
.
Here is a stack trace from our container startup. The line in instrumentation.js
corresponds with:
(node:1) MaxListenersExceededWarning: Possible EventEmitter memory leak detected. 11 remove listeners added to [BoundPool]. MaxListeners is 10. Use emitter.setMaxListeners() to increase limit
at new Promise (<anonymous>)
at /srv/node_modules/.pnpm/@nestjs+terminus@10.2.3_drcoupryhn6e72cgxfoljijmlu/node_modules/@nestjs/terminus/dist/health-indicator/database/typeorm.health.js:17:71
at Generator.next (<anonymous>)
at TypeOrmHealthIndicator.<anonymous> (/srv/node_modules/.pnpm/@nestjs+terminus@10.2.3_drcoupryhn6e72cgxfoljijmlu/node_modules/@nestjs/terminus/dist/health-indicator/database/typeorm.health.js:134:28)
at TypeOrmHealthIndicator.pingDb (/srv/node_modules/.pnpm/@nestjs+terminus@10.2.3_drcoupryhn6e72cgxfoljijmlu/node_modules/@nestjs/terminus/dist/health-indicator/database/typeorm.health.js:94:16)
at __awaiter (/srv/node_modules/.pnpm/@nestjs+terminus@10.2.3_drcoupryhn6e72cgxfoljijmlu/node_modules/@nestjs/terminus/dist/health-indicator/database/typeorm.health.js:13:12)
at new Promise (<anonymous>)
at /srv/node_modules/.pnpm/@nestjs+terminus@10.2.3_drcoupryhn6e72cgxfoljijmlu/node_modules/@nestjs/terminus/dist/health-indicator/database/typeorm.health.js:17:71
at Generator.next (<anonymous>)
at TypeOrmHealthIndicator.<anonymous> (/srv/node_modules/.pnpm/@nestjs+terminus@10.2.3_drcoupryhn6e72cgxfoljijmlu/node_modules/@nestjs/terminus/dist/health-indicator/database/typeorm.health.js:107:40)
at DataSource.<anonymous> (/srv/node_modules/.pnpm/opentelemetry-instrumentation-typeorm@0.41.0_@opentelemetry+api@1.9.0/node_modules/opentelemetry-instrumentation-typeorm/dist/src/typeorm.js:267:42)
at ContextAPI.with (/srv/node_modules/.pnpm/@opentelemetry+api@1.9.0/node_modules/@opentelemetry/api/build/src/api/context.js:60:46)
at AsyncLocalStorageContextManager.with (/srv/node_modules/.pnpm/@opentelemetry+context-async-hooks@1.26.0_@opentelemetry+api@1.9.0/node_modules/@opentelemetry/context-async-hooks/build/src/AsyncLocalStorageContextManager.js:33:40)
at AsyncLocalStorage.run (node:async_hooks:346:14)
at /srv/node_modules/.pnpm/opentelemetry-instrumentation-typeorm@0.41.0_@opentelemetry+api@1.9.0/node_modules/opentelemetry-instrumentation-typeorm/dist/src/typeorm.js:267:82
at TypeormInstrumentation._endSpan (/srv/node_modules/.pnpm/opentelemetry-instrumentation-typeorm@0.41.0_@opentelemetry+api@1.9.0/node_modules/opentelemetry-instrumentation-typeorm/dist/src/typeorm.js:283:30)
at /srv/node_modules/.pnpm/opentelemetry-instrumentation-typeorm@0.41.0_@opentelemetry+api@1.9.0/node_modules/opentelemetry-instrumentation-typeorm/dist/src/typeorm.js:267:106
at DataSource.query (/srv/node_modules/.pnpm/typeorm@0.3.17_better-sqlite3@9.4.0_ioredis@5.3.2_pg-query-stream@4.5.3_pg@8.11.3__pg@8.11.3__nbwcw7yzxgjrpbcbytvl7a4bfy/node_modules/typeorm/data-source/DataSource.js:345:42)
at PostgresQueryRunner.query (/srv/node_modules/.pnpm/typeorm@0.3.17_better-sqlite3@9.4.0_ioredis@5.3.2_pg-query-stream@4.5.3_pg@8.11.3__pg@8.11.3__nbwcw7yzxgjrpbcbytvl7a4bfy/node_modules/typeorm/driver/postgres/PostgresQueryRunner.js:174:47)
at PostgresQueryRunner.connect (/srv/node_modules/.pnpm/typeorm@0.3.17_better-sqlite3@9.4.0_ioredis@5.3.2_pg-query-stream@4.5.3_pg@8.11.3__pg@8.11.3__nbwcw7yzxgjrpbcbytvl7a4bfy/node_modules/typeorm/driver/postgres/PostgresQueryRunner.js:68:18)
at PostgresDriver.obtainMasterConnection (/srv/node_modules/.pnpm/typeorm@0.3.17_better-sqlite3@9.4.0_ioredis@5.3.2_pg-query-stream@4.5.3_pg@8.11.3__pg@8.11.3__nbwcw7yzxgjrpbcbytvl7a4bfy/node_modules/typeorm/driver/postgres/PostgresDriver.js:870:16)
at new Promise (<anonymous>)
at /srv/node_modules/.pnpm/typeorm@0.3.17_better-sqlite3@9.4.0_ioredis@5.3.2_pg-query-stream@4.5.3_pg@8.11.3__pg@8.11.3__nbwcw7yzxgjrpbcbytvl7a4bfy/node_modules/typeorm/driver/postgres/PostgresDriver.js:871:25
at BoundPool.connect (/srv/node_modules/.pnpm/@opentelemetry+instrumentation-pg@0.45.1_@opentelemetry+api@1.9.0/node_modules/@opentelemetry/instrumentation-pg/build/src/instrumentation.js:306:22)
at BoundPool.addListener (node:events:611:10)
at _addListener (node:events:593:17)
at wrappedFn (node:internal/errors:538:14)
at genericNodeError (node:internal/errors:984:15)
Same issue, our database calls are now very very slow as well. Downgrading to auto-instrumentations-node: v0.48.0 solved this issue for us
As a workaround, downgrade of auto-instrumentation-node
to 0.50.0
solves it, since it references "@opentelemetry/instrumentation-pg": "^0.44.0"
, which doesn't have this issue.
FYI @maryliag (component owner)
Hi everyone, thank you for raising this issue! Indeed I made a mistake and was adding several event listeners to the same pool. Although this is obviously not ideal, I want to clarify the comment
duplicated event listeners are attached each time, which would probably also lead to wrong metrics data.
This (wrong metric data) is not happening, because I did noticed the updater was getting called a few times, so I made sure it always sends the correct value, event if called multiple times. I just didn't realize the reason why it was getting called multiple times 😓
Thanks to the input from this issue, I was able to create a fix for it. Also some minor performance improvements on https://github.com/open-telemetry/opentelemetry-js-contrib/pull/2484
If you continue seeing this issue on a version with this fix, please let me know!
Thanks @maryliag !
We've been seeing some serious performance issues that coincidentally began right after we updated our opentelemetry deps.
Specifically, over the course of several hours, db-pool.connect
durations would get slower and slower, eventually growing from a few ms to 2+ seconds until eventually the whole process would come to a halt, waiting for db connections to open. Rolling back @opentelemetry/auto-instrumentations-node
^0.50.1
-> 0.48.0
fixed the problem for now.
Hi guys! I was facing the same issue, so for now, I downgraded the @opentelemetry/auto-instrumentations-node to version 0.49.0. This temporarily fixed the problem until a hotfix is released
"@opentelemetry/auto-instrumentations-node": "0.49.0"
durations would get slower and slower, eventually growing from a few ms to 2+ seconds until eventually the whole process would come to a halt
this seems to align with a new listeners being added on every new connection, which is what the PR I created is fixing, so hopefully this will solve your problem. I appreciate everybody's patience until this gets merged and added to a new release 🙏 And I apologize for any problems this might have caused.
Also, hope the new metrics related to pool will be helpful 😄
Expereienced this too, had to rollback to 0.44.0 of instrumentation-pg
. Caused a significant performance impact on one of our systems.
To add, when using the event listeners, they should be a matching remove to avoid this.
It would be nice of the metrics could be an opt-in
they should be a matching remove to avoid this
when do you suggest the removal would occur?
It would be nice of the metrics could be an opt-in
I updated the metrics generation to be under a feature flag OTEL_SEMCONV_STABILITY_OPT_IN
on the PR I mentioned above , which will need to be database
or database/dup
What version of OpenTelemetry are you using?
@opentelemetry/instrumentation-pg: 0.45.1
What version of Node are you using?
v20.15.1
What did you do?
Used
@opentelemetry/instrumentation-pg
with TypeORM and pg-pool. Received warning about possible memory leak, which happens since listeners are constantly attached for each query. Partial stack trace:What did you expect to see?
No memory leak.
Additional context
This issue was introduced in this commit - https://github.com/open-telemetry/opentelemetry-js-contrib/commit/12adb4354f09ade438cd96340bdfd1f715b5fed3#diff-1360de1aed58c38c5cbd3a1759a8af761049ea2c562a477964623fb2358b2730R379 Since pg.Pool.connect is called for each query to obtain connection, duplicated event listeners are attached each time, which would probably also lead to wrong metrics data.
It seems in order to fix it, additional check is needed to make sure event listeners are added only on a first
connect
call.