mongodb / mongo-php-library

The Official MongoDB PHP library
https://mongodb.com/docs/php-library/current/
Apache License 2.0
1.6k stars 262 forks source link

Is it always safe to ignore errors starting implicit change stream sessions? #611

Closed timwhitlock closed 5 years ago

timwhitlock commented 5 years ago

This is just a question really. Regarding the ignoring of this exception when starting an implicit change stream session.

I was experimenting with change streams on a hidden replica set member and an exception with code 11002 ("Server does not support sessions") is caught and ignored here.

So I suppose the question is whether all exceptions can be ignored. The note says "We can ignore the exception ... and there is no risk of a mismatch". Is that the case in my example, or should it object to what I'm doing? My understanding of sessions in this context isn't sufficient to know if I'm doing something wrong.

For the record I've no particular reason for connecting to a hidden member, other than I didn't want to mess around on an active server while testing.

jmikola commented 5 years ago

I was able to reproduce this with a simple two-node replica set (one primary and one hidden secondary) and I believe you're hitting an edge case that can safely be ignored. Read on for more details.

The exception you're seeing originates from _mongoc_topology_pop_server_session() in libmongoc. Before constructing a session, we check that the topology (i.e. server(s) to which we're connected) supports sessions by examining the value of a cached session_timeout_minutes field on the topology's struct. If this field is unset, libmongoc makes an additional attempt to select a server before giving up. This handles the case where the application is attempting to create a session immediately after constructing a client but before performing any other operation that would initialize sockets (noting that Manager and Client constructor perform no IO on their own).

The SDAM machinery responsible for monitoring the servers/topology checks for a logicalSessionTimeoutMinutes field in each server's isMaster response to determine if the sessions are supported. If such a field is encountered, mongoc_server_description_handle_ismaster() updates session_timeout_minutes for the server description; however, the topology record is not updated yet.

Even though the hidden replica set member's response does include this field, the SDAM spec requires that we ignore it because it is not a data-bearing server type. Data-bearing nodes are those that the application can read/write to under normal operation (i.e. mongos, standalone, primary, or secondary). We can see this implemented in _mongoc_topology_description_update_session_timeout() where the iteration skips servers where _is_data_node() is false.

Therefore, session_timeout_minutes on the corresponding topology struct is never updated and libmongoc is left to assume that sessions are not supported by the topology at runtime. This leads to the exception you're seeing.

Under normal circumstances, applications would never communicate with a hidden member. A hidden member in a URI can be used for initial discovery (e.g. finding a primary and other secondaries), but it will thereafter be removed from the list of known servers. The edge case here is that you're connecting to the node as if it were a standalone and actually querying it. I can propose that we amend the SDAM spec to reconsider a hidden replica set member a data-bearing node iff the driver is connecting to it as a standalone, although this is likely to be a low-priority fix even if I can convince others that this is worthwhile. In any event, I'll create an internal ticket (our JIRA for spec issues is private) and follow up once there's something to report.

So I suppose the question is whether all exceptions can be ignored. The note says "We can ignore the exception ... and there is no risk of a mismatch". Is that the case in my example, or should it object to what I'm doing? My understanding of sessions in this context isn't sufficient to know if I'm doing something wrong.

I can confirm that the comment still applies here. The whole point of "implicit from the user's perspective" in PHPLIB-342 is that because of the disconnect between libmongoc and PHPLIB, change streams cannot simply rely on libmongoc to provide an implicit session in the absence of an explicit session provided by the application. Therefore, PHPLIB creates its own to ensure we can keep the same session for the first attempt and any subsequent resume ops.

Since libmongoc has determined that sessions are not supported, it will never provide its own implicit session in the absence of the one PHPLIB creates in the Watch operation. Therefore, we don't have to worry about a different session getting used for a resume op. The only downside to all of this is that none of the operations executed on a hidden member (change streams or otherwise) will ever use a session.

timwhitlock commented 5 years ago

Thank you very much for yet another thorough reply.

I appreciate that it's probably not normal to connect to a hidden member as a standalone. Although I had no pertinent reason in this example, it may be worth noting a possible use case I've been considering.

My interest in Change Streams is related to an auditing system I'm adding to an existing application. This would require a permanently running script logging all relevant data changes. As this could be resource intensive I was thinking of connecting to a hidden member to do it.

jmikola commented 5 years ago

Initial internal response to the proposed change for the SDAM spec was positive, so I think there's a good chance it (and drivers) ultimately get changed to support this use case.

As this could be resource intensive I was thinking of connecting to a hidden member to do it.

For the time being (and even down the line), I think a good solution would be to replace the hidden node with a tagged secondary. You could also keep the secondary's priority at zero to ensure it is never elected. This would allow your monitoring script to use the same URI as your application and simply select the desired node with a combination of "secondary" and the appropriate tag set (see: ReadPreference::__construct()).

If you only plan on trailing a single change stream (presumably watching cluster or database-wide changes), I think it'd be reasonable to start with this approach and keep an eye on performance. SERVER-32946 comes to mind as a recent ticket opened about change stream performance where the actual culprit wasn't the change stream API itself but rather a particular driver's (not PHP) connection/socket behavior. That ticket also involved an application opening many change streams.

timwhitlock commented 5 years ago

Thanks for the tip regarding tags. I'll look into that.

I was planning on trailing a single, cluster-wide stream, but will split into DB streams if I encounter issues. I'll be ramping up this new feature very slowly indeed.

jmikola commented 5 years ago

Please feel free to open another issue down the line if you have any questions. I think everything in this thread has been addressed so I'll close it out.

Any future updates to the SDAM spec will show up as a PR in the mongodb/specifications repo.