open-telemetry / opentelemetry-js-contrib

OpenTelemetry instrumentation for JavaScript modules
https://opentelemetry.io
Apache License 2.0
697 stars 513 forks source link

Mongodb hierarchy bug #849

Closed seblaz closed 3 months ago

seblaz commented 2 years ago

What version of OpenTelemetry are you using?

1.0.4

What version of Node are you using?

14.18.3

What did you do?

I set up an express app that calls a mongodb backend on every request. I also prepared a way to reproduce the issue with instructions in this repo.

What did you expect to see?

I expected that every express span had a mongo span associated.

What did you see instead?

Some mongo spans shared the same parentId, correlating them with an incorrect express span.

Additional context

Please have a look at this repo where a complete example can be found reproducing the issue. Notice that this happens only with a bit of load (more than 15 concurrent requests in my case).

seblaz commented 2 years ago

Hi @Flarna @vmarchaud sorry to bother you, could you reproduce the issue?

Flarna commented 2 years ago

Sorry, not time yet to look into this and I can't promise any date.

jscherer92 commented 2 years ago

I'll get this setup locally in the next few days to reproduce.

jscherer92 commented 2 years ago

@seblaz I have reproduced this on my local machine. Will take a look what is going on.

jscherer92 commented 2 years ago

@Flarna @seblaz I have found the root cause of this bug, but I do not know how we want to proceed. The MongoDB instrumentation goes to the low level protocol of MongoDB. But, the actual calls being made to these low level calls happen in an asynchronous context.

So what is happening is that the underlying wire protocol calls may be getting executed in a different context. By the time they are called, the active context has already changed. I was looking at different ways of pinning the context, but with the current implementation there is no direct way of doing this.

Any ideas would be appreciated. There are ways around this bug that can be done by the user, but it looks like we may have to look into an adapted version of the current MongoDB implementation that doesn't just wrap the low level wire protocol.

Flarna commented 2 years ago

That's usually not simple to fix. I can give some general hints but as I'm not familiar with the mongodb internals I can't provide a concrete solution. The toolbox available has api.context.bind() to wrap a function or an EventEmitter. This is helpful if there is some async API called with the right context but the callback is running on the wrong context for whatever reason. By monkey patching this API and replace the given callback by api.context.bind(api.context.active(), callback) on every call this context loss is solved. If it's not a classic callback API but an EventEmitter emitting with wrong context it's similar.

Another tool is api.context.with() to execute a function with a specific context active. It's used by api.context.bind() behind the scenes (active context is stored at wrap time and with is used to activate it on callback call time). But you can use it also without bind at whatever place you want to set a specific context as active. If some async API is not pure callback nor EventEmitter based (e.g. some hand crafted callback based iterator or similar) this is usually the way to go.

But in all cases the main task is to find the place where context gets lost and find out if this is patchable.

jscherer92 commented 2 years ago

Thank you for giving a bunch of suggestions. So far all the root locations didn't get solved by any of these methods. I am thinking another file may need to be wrapped.

As soon as the method runs and I can get a context it is the wrong context which tells me the API needs to wrapped higher up.

jscherer92 commented 2 years ago

@Flarna I think the best move forward is actually a rewrite instead of utilizing the wire protocol, we instead intercept the calls on the collection. This should give the users what they want, and it should also cover both mongo versions, those less than 3 and above 4. This is a big rewrite, but it seems to be the correct option moving forward.

@seblaz have you used the v4 client for mongo and if you have does that work better? I am trying to figure out temporary work arounds to help with this other than creating your own spans around the Mongo calls for right now.

Been looking off and on at this for a while now and this seems the be the way to make sure the calls are actually captured since this is cursor based.

Flarna commented 2 years ago

@vmarchaud any opinion on this? As far as I remember you did the main work to get the initial mongodb instrumentation.

Flarna commented 2 years ago

@jscherer92 By the way an alternative approach for issues like this would be to create PRs on mongodb driver side using AsyncResource to ensure that transactional context is preserved.

jscherer92 commented 2 years ago

I'll put an MR over to them and see if they accept. I'll link the issue

vmarchaud commented 2 years ago

Actually the description of the issue make me think of another issue that we had back in the day where the context was mixed up when the load was above a specific threshold. @jscherer92 Do you know if the issue happens with a low load (<15req/s like @seblaz mentionned) ?

jscherer92 commented 2 years ago

I got this to happen as low as 3 requests. The second mongo call got the context of the third http call because the cursor didn't get called until that context took over.

vmarchaud commented 2 years ago

@jscherer92 FYI someone already tried to implement the async resources mechanism in the past (https://github.com/mongodb/node-mongodb-native/pull/1846) but i remember another discussion that ended up saying something like "we have an APM API for this, it might be enough" (https://mongodb.github.io/node-mongodb-native/2.2/reference/management/apm/ or https://www.mongodb.com/community/forums/t/what-is-the-status-of-apm-in-mongo-node-js-native-driver/98770/2) but i might be mistaken

jscherer92 commented 2 years ago

this looks correct. So do we want to wrap what they have, or do we want to close our implementation? It looks like they are stating if you want to hook this into another telemetry solution you do it in a customized fashion. We could always make a custom wrapper around this.

I'm open to whichever we want to do. I was playing with different ways of potentially solving this issue w/o touching their source code, but it does seem that their cursor implementation is the problem and I was looking through their bugs and there have been a couple related to the context being lost but there is no traction on it.

jscherer92 commented 2 years ago

what I am going to do is create a separate MongoDB implementation wrapping their APM implementation. That way this breaks no one else and we can have one that will hold the context.

jscherer92 commented 2 years ago

Sorry for the long lead time on this, work has me working quite a bit. I am still thinking this won't fix the issue but I'm still building it. I should have more time this week to finish it up.

github-actions[bot] commented 2 years ago

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days.

dyladan commented 2 years ago

not stale. @jscherer92 any update?

jscherer92 commented 2 years ago

Sadly no. I'm not in between four projects so I can't really work on this anymore but either it needs to be rebuilt from the ground up, or someone needs to talk with the Mongo team and get it wrapped in AsyncResource

github-actions[bot] commented 2 years ago

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days.

jscherer92 commented 2 years ago

Just want to comment, not stale. Really two options: 1) new plugin that wraps higher level methods. 2) get Mongo team to use async resources so the wire protocol file is not completely detached from the async stack

dyladan commented 2 years ago

Yeah this should never be stale. I think the only way to reliably fix is to fix it on the mongodb side.

vmarchaud commented 2 years ago

They refactored their client anyway in mongo 4 (so i don't think we'll get anything in old releases), i'm not sure if the issue has been fixed with it ?

dyladan commented 2 years ago

@seblaz can you confirm if @vmarchaud is correct that this might have been fixed in mongo 4?

david-luna commented 12 months ago

Hi @seblaz @dyladan @pichlermarc

I think I've fixed this in a different issue https://github.com/open-telemetry/opentelemetry-js-contrib/issues/1688

Basically the connection pool was queueing the queries until a connection was available. Hence the context loss.

If you confirm is the same issue you can proceed to close it :)

david-luna commented 6 months ago

@seblaz

Could you confirm if the issue was fixed?

david-luna commented 5 months ago

Hi @seblaz

the fix mentioned in my previous comment applies to mongodb@4 and above. I've tested in your app by updatating OTEL dependencies and also mongodb package

--- a/app/package.json
+++ b/app/package.json
@@ -5,17 +5,16 @@
   "repository": "git@github.com:seblaz/opentelemetry-mongodb-bug.git",
   "author": "seblaz <sebastian.blazquez96@gmail.com>",
   "license": "MIT",
-  "private": false,
   "dependencies": {
-    "@opentelemetry/api": "^1.0.4",
-    "@opentelemetry/instrumentation": "^0.27.0",
-    "@opentelemetry/instrumentation-express": "^0.27.0",
-    "@opentelemetry/instrumentation-http": "^0.27.0",
-    "@opentelemetry/instrumentation-mongodb": "^0.27.0",
-    "@opentelemetry/sdk-trace-base": "^1.0.1",
-    "@opentelemetry/sdk-trace-node": "^1.0.1",
+    "@opentelemetry/api": "^1.8.0",
+    "@opentelemetry/instrumentation": "^0.51.0",
+    "@opentelemetry/instrumentation-express": "^0.38.0",
+    "@opentelemetry/instrumentation-http": "^0.51.0",
+    "@opentelemetry/instrumentation-mongodb": "^0.43.0",
+    "@opentelemetry/sdk-trace-base": "^1.24.0",
+    "@opentelemetry/sdk-trace-node": "^1.24.0",
     "express": "^4.17.2",
-    "mongodb": "3.7.3"
+    "mongodb": "4.0.0"
   },

This makes the parentIds to be diferent for each mongodb.find operation

2024-05-02 16:54:53   parentId: 'e64884eedf666df2',
2024-05-02 16:54:53   parentId: '5ac647daab2e51ee',
2024-05-02 16:54:53   parentId: '319d41c0b76d7695',
2024-05-02 16:54:53   parentId: '9860b516c8977cab',
2024-05-02 16:54:53   parentId: '39b934b0ce64f7ee',
2024-05-02 16:54:53   parentId: 'ba1628ba44c5a743',
2024-05-02 16:54:53   parentId: '9ade2ccf7fa82e47',
2024-05-02 16:54:53   parentId: 'd9569c018ec996d6',
2024-05-02 16:54:53   parentId: 'c95af5377ca3508f',
2024-05-02 16:54:53   parentId: '44ce6fae426bdc59',
2024-05-02 16:54:53   parentId: '69f6702f1fb4cc46',
2024-05-02 16:54:53   parentId: 'c449070d9b4e48c9',
2024-05-02 16:54:53   parentId: 'b3519a0b0ad64329',
2024-05-02 16:54:53   parentId: '2bd2bb4bcb5db079',
2024-05-02 16:54:53   parentId: '8cb7756941db84f5',

So I suggest to update the dependencies. Let me know if for some reason you cannot update mongodb

pichlermarc commented 3 months ago

@david-luna thanks for providing the fix for this. :slightly_smiling_face:
@seblaz if this is still a problem with a more current release please file a new issue so that we can investigate.