Refactor Session ID handling

This PR hardens the implementation to use random values instead of the request_id for the OIDC Nonce and client session access token. The major changes are:

Adds generateID() function from https://github.com/nginxinc/nginx-saml/blob/c3baf97d2c90e8a597803efd819095353881da83/saml_sp.js#L551 without modifications
Use generateID() to generate the token request Nonce value
Use generateID() to generate the client session token
Use a HMAC hash of the client session token as the keyval keys and log IDs to help protect against session theft.
Rotate the client session token each refresh

Rotating the client session token each refresh is the only real behavior change introduced. This will result in a Cookie set header being added to non-OIDC requests when the session token is refreshed. This change was made as a general security improvement, but mostly because it simplifies the implementation allowing for one keyval store update codepath for both new sessions and refreshes.

Other than the rotation this change should be a no-op behavior wise. I have made an effort to verify the major behaviors and it passes my testing.

Hi @ag-TJNII and thank you for your contribution to our project and for paying attention to security-related aspects. However, I must decline this approach for the following reasons:

Performance. Your solution significantly impacts performance by shifting from a model where NJS is only called for the OIDC flow to a model where NJS is invoked for every new request. This approach is slow. It's important to understand that NGINX creates and destroys an NJS VM for each HTTP request, which leads to a performance degradation of up to 50% even for the simplest scripts (like r.return).
You have shifted to using random values in NJS instead of using NGINX request_id. However, it is not clear why the request_id is inadequate or unsafe for this purpose.
Coding style and comments. Please adhere to the existing coding style. We don't use DocBlock comments and function descriptions should be minimal and only added where absolutely necessary.

You have shifted to using random values in NJS instead of using NGINX request_id. However, it is not clear why the request_id is inadequate or unsafe for this purpose.

This change is the core of this proposed PR. The main reason why I'm considering request_id to be unsafe is because the current implementation uses it as a client session token. In this implementation the request_id used in the initial OIDC request flow is the client session token, stored in the auth cookie, which is redeemable for a session.

While the request_id is random, it is a documented core variable which likely will be used outside the scope of this module. It's reasonable for client config to use the request ID for other purposes, like a request trace ID. Once the variable is used for other purposes it is exposed, which is a risk as the ID is directly redeemable for a client's session.

A decent example of this is https://github.com/nginxinc/nginx-openid-connect/blob/39334b6616690b652bad9fa5f0d3c72df8759ece/openid_connect.js#L193 in this repo, where we log the request ID as part of the OIDC flow. This means the logs contain valid user session tokens directly redeemable for client sessions. That is logging secrets and is dangerous. If a client's Nginx config uses the request ID in the same way it will have the same result.

OWASP recommendation against this behavior: https://cheatsheetseries.owasp.org/cheatsheets/Session_Management_Cheat_Sheet.html#logging-sessions-life-cycle-monitoring-creation-usage-and-destruction-of-session-ids

CWE on logging secrets: https://cwe.mitre.org/data/definitions/532.html

Before I address the other concerns I think we need agreement on this, as it is core to the proposed change.

Yes, I asked that question to ensure that we are on the same page regarding the understanding and nuances of security issues related to implementation.

In this implementation the request_id used in the initial OIDC request flow is the client session token, stored in the auth cookie, which is redeemable for a session.

This is correct, and it's worth clarifying that:

Using the request_id as a session identifier does meet the minimal security requirements from a cryptographic perspective, as it provides 128 bits of entropy. This is because we generate 16 bytes of cryptographically secure random data using OpenSSL's RAND_bytes() function.
It only pose a potential threat if the request_id is used elsewhere (as you rightly pointed out).

Regarding the logging issue, it's somewhat separate and does not directly relate to the method of generation or use of session cookies. Technically, we could log any confidential or potentially vulnerable data, such as the cookie_auth_token. It is up to the admin to understand the potential risks, and NGINX does not intervene in this. However, I agree that we can consider logging different information associated with a specific session.

Key points to consider:

The current NJS-based solution does not provide convenient troubleshooting methods. It is always essential to keep in mind how a customer can correlate a specific client with a session and logs for troubleshooting or debugging purposes.
As far as I remember, we cannot clear the key-val without using N+ API, meaning that significant changes to how session information is generated and stored in keyval could impact the memory allocated for the shared zone => after updating, clients may need to adjust their limits.
The implementation details of zone_sync in N+ impose some constraints on the shm zone. It's necessary to avoid frequent and multiple changes in the key-val store unless absolutely necessary.

Technically, we could log any confidential or potentially vulnerable data, such as the cookie_auth_token. It is up to the admin to understand the potential risks, and NGINX does not intervene in this.

Technically true, however in practice we shouldn't. This repo is advertised as a reference implementation, so I think the majority of admins expect this to be reasonably secure by default. I think logging valid session tokens which can be redeemed for a user's session with a simple cookie header would take the majority of admins by surprise. I know I was surprised when I discovered this, and my immediate thought wasn't "As an admin I need to secure this log", instead it was "if they're getting the security fundamentals of 'don't log passwords' wrong, what else does this repo get wrong". Not confidence inspiring. This can be mitigated by clearly documenting the logging behavior, but looking over the README I don't believe the session token logging behavior is documented. Correct me if I'm wrong.

The current NJS-based solution does not provide convenient troubleshooting methods. It is always essential to keep in mind how a customer can correlate a specific client with a session and logs for troubleshooting or debugging purposes.

This is a good point. I don't think changing from the request ID to a random value alone has a major changes on this topic, but the one-way function does. I'd like to table this and come back to it once we discuss the one-way cookie to keyval id function implementation, which also matters for the concern about using the njs js_set function in each request. Before we dig into that I'd like to wrap up the discussion on the change from request_id to a random value.

As far as I remember, we cannot clear the key-val without using N+ API, meaning that significant changes to how session information is generated and stored in keyval could impact the memory allocated for the shared zone => after updating, clients may need to adjust their limits. The implementation details of zone_sync in N+ impose some constraints on the shm zone. It's necessary to avoid frequent and multiple changes in the key-val store unless absolutely necessary.

So this is a good point, the session rotation behavior will result in increased zone memory usage as each rotation will result in a new key. The implementation will flush out the values, so using the 1h session / 8h refresh default TTLs, I wouldn't expect the refresh token zone memory usage to increase 8x overall due to the flushed values, but the memory usage for the keys by themselves will increase 8x.

I was on the fence about rotating the keys, I think you've just convinced me not to. I'll redo that change.

For general memory usage I think the simplest answer is just make the new random value the same length as the old key. What is your opinion on making that a configuration option with a reverse compatible default, so that admins are not surprised on upgrade but advanced users can increase their session ID entropy if they desire/require?

Most old-school admins understand the NGINX security model quite well, which is based on the Unix approach to rights separation. In simple terms, there are two entities - the admin and everyone else. The admin can do anything; everyone else, nothing. NGINX's model lacks RBAC, and as an admin, you should understand this and make appropriate adjustments in your security lifecycle. By default, keyvals are writable and readable only by the NGINX user, and the error.log can only be read by the NGINX user and group. This njs-based implementation is based on the NGINX approach. If you've come to the conclusion that "As an admin I need to secure this log," then yes, you have bigger problems not just with OIDC.

Let me reiterate:

We don't log passwords; however, we do log session cookies.
The error.log file, where this message is written, by default should only be readable by users with exclusive rights.
I'm not trying to say that this improvement is meaningless; however, in the proposed approach, you must consider such an important aspect as troubleshooting.

Before we dig into that I'd like to wrap up the discussion on the change from request_id to a random value.

Yes, I see no harm in this improvement.

What is your opinion on making that a configuration option with a reverse compatible default, so that admins are not surprised on upgrade but advanced users can increase their session ID entropy if they desire/require?

This is a good question. We planned to incorporate such a possibility in the SAML implementation but never made it a separate option. Personally, I think it's a good idea because it offers flexibility. However, I'd only include it in the documentation and not expose it in the openid_connect_configuration.conf.

We don't log passwords; however, we do log session cookies.

Yes, and this is the crux of my concern. The session cookie isn't a password, but for as long as the session is valid it is functionally equal to a password. If a user sets up a valid session then the contents of that cookie can be used to access the protected service as easily as if the attacker had their password. Even easier, frankly, as modern username/passwords should also need MFA, but the session cookie has already gotten past that flow. If I have a user's session cookie then impersonating them is as simple as adding a Cookie header to my HTTP requests.

Not only are these cookies logged, they're also available via the Nginx API and on disk. I did not set a umask on the process (as I assume most won't) and I see the file is world-readable:

# ls -l
total 12
drwx------ 3 nginx root    15 Apr 16 18:43 jwk_cache
-rw-r--r-- 1 nginx nginx 2426 Apr 17 10:03 oidc_access_tokens.json
-rw-r--r-- 1 nginx nginx 1716 Apr 17 10:03 oidc_id_tokens.json
-rw-r--r-- 1 nginx nginx 1228 Apr 17 10:03 refresh_tokens.json
# cat oidc_id_tokens.json
{"HyIna4J_f-YpdkuBH6bqnVvpq9powq1GOq2pPEEcgJ0":{"value":"[Snip real token]","expire":1713351823183},"Gw0XZRXEtCZ3vm4RoYOO7Ppg5eWSCsDhzASEU2p2Gw4":{"value":"-","expire":1713351823183}}

So if an attacker can read this file with the current master branch code they also have values they can immediately use to impersonate a user, with no safeguards. The current README doesn't document that the keyval store spool directory needs to be only accessible by the Nginx user, and the default config puts it in the conf directory. Should the admin figure this out? Arguably yes, but we should provide a default config that doesn't rely on the deploying engineer being on-the-ball and fully understanding the nuance and security of the underlying implementation. "Secure out of the box" is a reasonable expectation nowadays.

The error.log file, where this message is written, by default should only be readable by users with exclusive rights.

It's common practice now to run application logs into log aggregation systems for analysis. These logs will commonly be accessible by security, engineering, and development teams. This has been my experience as a professional Linux operations engineer for over 10 years now. The days of the error log being a single protected file on the host, honestly was never a thing as syslog based log aggregation has been a standard since before I started my career. It's a reasonable expectation that logs will not contain secrets or credentials, especially for a web server like Nginx. Asserting that the error log must be protected so that we can log valid session cookies into it is, frankly, not reasonable and one that I think would surprise the majority of users.

And I'm not just being paranoid, this kind of attack is gaining popularity:

The 2023 Okta break was done by stealing session cookies out of HAR files: https://www.techtarget.com/searchsecurity/news/366558292/Okta-breach-led-to-hijacked-sessions-for-5-customers
The 2023 Microsoft breach involved session token theft: https://www.cisa.gov/sites/default/files/2024-04/CSRB_Review_of_the_Summer_2023_MEO_Intrusion_Final_508c.pdf (pg 7)

The CWE on this topic flatly says "Do not write secrets into the log files." Not protect the log, not limit access to the log, "Do not." https://cwe.mitre.org/data/definitions/532.html

Also, currently the only safeguard available in this implementation against session hijacking sending the logs to a SIEM for analysis to detect usage pattern. This data stream should arguably include the error log, as that's where this implementation logs that it's created new tokens and accepted logins. We shouldn't assume the error log is secure, there are many common and valid cases where it will be centrally stored and available for analysis.

I'm not trying to say that this improvement is meaningless; however, in the proposed approach, you must consider such an important aspect as troubleshooting.

I assert that logging the session cookie in the name of debugging is like cutting a hole in the side of the building in the name of accessibility. Yes, it will be very easy to get in and out, but there are a lot of negatives that come with it that outweigh the positives. It's an option, but there are others that are likely better.

And to be clear I'm absolutely not dismissing the troubleshooting concern. That is valid and will need to be addressed. I don't think logging the actual cookie contents is the correct pattern, however. But I'd like to come back to that once we agree on how the session will be handled, as how to troubleshoot is implementation specific. I have some ideas on how to mitigate/resolve this concern.

So with that this is actually a good transition into discussing the cookie value hashing I implemented in https://github.com/nginxinc/nginx-openid-connect/pull/89/files#diff-1dfe8897de6fda3d50975bb6eb32ef0b776015cb9ed38f058ed05c3f3a70d90aR281-R293, as it's highly related to these points. The core assertions I made when implementing this is:

The cookie value is a user's credential into the system, and is even more sensitive than their password as it has already passed any MFA and behavior checks.
The user's cookie value must never be logged to avoid session theft.
The user's cookie value must not be accessible via APIs or HTTP endpoints.
The user's cookie value must not be stored or replicated to avoid session theft.

That last point is a hard one, for obvious reasons. In the current implementation we store the user's session cookie and use it as a keyval key for the keyval zone stores. This is a clean, fast, and simple solution, but now the user's session tokens are stored to disk and transferred over the wire if sync is enabled. If a operator sets up a sync server block without TLS, which is clearly a misconfiguration but one easy to do, now the user's session tokens are going over the wire in the clear. If the admin enables the Nginx API, which is on by default in the example, the user's session cookies are available via a simple HTTP GET. This is the problem the generateKeyValID() function is meant to resolve. Rather than use the session cookie value directly for the keyval store keys, which opens up all these potential attack vectors, I ran it through generateKeyValID, which is a one-way cryptographic function (SHA256 HMAC), and then used that as the keyval store key. This implementation changes the attack surface as such:

The session cookie value is only accessible in the user's request context, it isn't persisted
The session cookie value isn't stored to disk or available via the Nginx API
The keyValID value is generated uniquely from the session value, but the keyValID cannot be reversed back to the session cookie valie
The keyValID value is not redeemable for a session, and as such doesn't need to be treated as sensitive.
The keyValID value can be logged, can be exposed via the API, can be persisted to disk, and can be sent over the network. These things all should still be protected, but now if the server is misconfigured and they aren't valid session tokens aren't in the clear.

That's the other, and honestly larger, crux of this change. By using a one-way function on the cookie value and using that as the keyval key we mitigate most of these threats by using a key that's not redeemable for a session.

This implementation requires the use of js_set for each call. I don't think there is a way around that. If there is a one-way function available in directly in Nginx config let me know, when I went looking I didn't find one. I haven't done any benchmarking so I cannot speak to performance implications.

This implementation requires the use of js_set for each call. I don't think there is a way around that. If there is a one-way function available in directly in Nginx config let me know, when I went looking I didn't find one. I haven't done any benchmarking so I cannot speak to performance implications.

That is why this implementation path was chosen initially. I don’t know if all the listed improvements can be implemented without njs or additional modules.

I changed the default paths for the state file to align with the approach we use for SAML. (see PR #90)

I changed the default paths for the state file to align with the approach we use for SAML. (see PR #90)

Great, that's the same path I settled on myself so glad to see I picked right. I checked the permissions on that directory yesterday in auditing my config and it was 0o750 from the installer, so that path should be good.

I don’t know if all the listed improvements can be implemented without njs or additional modules.

So I haven't benchmarked, I guess that's the next step as we're concerned about the njs performance. I'm currently of the opinion that the security benefits are worth the potential performance issues, but the current app I'm protecting isn't performance sensitive. Since I rolled back the client session cookie rotation the internal store sets are now basically unchanged, and the njs_set is in the openid_connect_configuration.conf file which is config. I'm thinking it would be best to make the new pattern or the old pattern selectable with good documentation. If users prefer the security proposed here they can use the njs_set method. If they prefer the performance and accept the risks, they can use the cookie value directly. Implementation wise I'm thinking this would entail a couple more config maps to select the hash or the bare client session ID, but I haven't tested yet.

At this point I see two paths forward on the client session ID handling/hashing proposal:

1) We accept the js_set performance loss in the name of security. This is my preference but I openly admit to being ignorant on how bad of a performance hit this will be, I have been testing this in a real deploy but that deploy has other performance bottlenecks unrelated to the OIDC implementation. I'm willing to benchmark and come back with real numbers but that will take a couple days. 2) We make it selectable, via config maps. I think this is a good compromise, documents that we've thought about this, and leaves it up to the user. If we're open to this option I'll work on implementing it. This will also take a couple days.

Let me know how you're leaning so I don't put effort into the wrong direction. Also if there is a config native way to do a one-way function let me know and I'll pivot to that, I'm working under the impression js_set is the only available implementation option. Thanks.

nginxinc / nginx-openid-connect

Refactor Session ID handling #89