Closed redbaron closed 11 months ago
I had a look at memguard
and it made me question value it provides. My understanding is:
memguard
as ismemguard
encrypts stored values in memoryLockedBuffer
, which is placed on mlock
ed page to prevent it leaking to swap and mprotect
s pages around it with no access to protect reading/writing it via buffer overrunsLockedBuffer
, resolves (EDIT: isn't this branch a dead code because by the time plugin calls Get()
value is fully resolved?) any refrerences to secret stores and mlock
s fully resolved plaintext value.If above is correct, then I'd say use of memguard
makes it no better comparing to simply storing plaintext values in memory, because:
memguard
?memguard
after linking.
memguard
enclave, but config TOML whenever it is coming from is freely available to the telegraf process and it contains secret in the plaintext already, so scrambling it in memory adds no value.mlock
of the page with the resolved secret value to prevent it from going to swap on disk.
output.influxdb_v2
, inputs.posgresql/postgresql_extensible
and outputs.amqp
, all take plaintext resolved secret []byte
, make new string
value and store it eslewhere for future use. I suspect majority of the plugins do the same, because they are built on top of various protocol libraries, often third party, which can't take LockBuffer
as an input and plugin authors have no choice, but to provide simple string
with secret data in it. As a result fully resolved plaintext data is constantly present in memory anyway.@redbaron,
Thanks for the issues around secret store. I want to sync with Sven and our security team next week. I think your improvements will certainly make for a better overall experience, but want to get us all on the same page.
@redbaron first of all, thank you for the comments and your interest in the topic! Let me try to address your questions/comments. Please feel free to ask follow-up questions or ask about points I did not address!
If above is correct, then I'd say use of memguard makes it no better comparing to simply storing plaintext values in memory, because:
- Secret data is stored in memguard after linking. Argument can be made that const string without any references can be a secret by itself and it is worth sealing it in a memguard enclave, but config TOML whenever it is coming from is freely available to the telegraf process and it contains secret in the plaintext already, so scrambling it in memory adds no value.
That is not the complete picture. The constant-string secret can result from a string in the TOML config option (as you mention), but it can also result from an environment variable as well as a secret-store, so it is not necessarily readable from the config file.
Furthermore, your comment (at least to me) implies that we want to protect against a malicious part within the Telegraf process. This is not the thread-model nor the goal nor do we claim that somewhere. The goal is to protect sensitive data "as-good-as possible" against other processes or users apart from the telegraf user and root. One part is to not store secrets on-disk in an unprotected fashion. This is what memguard does, it encrypts data marked as sensitive (i.e. having type config.Secret
) in memory "at-rest" to prevent read-out when being swapped out. When accessing the secret, we try to protect it "as-long-as possible" by locking the memory thus preventing it from being swapped out unencrypted.
Another part is to be able to access external authentication systems to allow a central management of credentials (e.g. oauth2 tokens etc). This is covered by the secret-store plugins which, as a goodie, now can also provide "dynamic" secrets that can change over time (as necessary by e.g. tokens). This not directly has something to do with memguard
but those features closely link to the config.Secret
implementation.
As you can see by the formulation, this is a best-effort approach, i.e. we try to protect secrets until they enter plugins. Now the plugin has the possibility to minimize the leak-surface (e.g. look at the radius
input plugin). For many plugins this is not possible as used external packages pass credentials as string
instead of using a byte-slice. However, there is only little Telegraf can do. But I do not agree with the idea to not do the best we can.
I'm also aware that the current state is far from being perfect:
memguard
is currently making excessive use of locked pages. I did put up a PR on memguard to mitigate this and greatly reduce (e.g. by a factor of >50x for typical secrets) the use of locked pages, but I did not receive feedback yet.Despite those points, I still think offering additional means to protect sensitive data (i.e. kicking the can further down the street) is worth the effort.
- only protection whole secret subsystem offers at the end is mlock of the page with the resolved secret value to prevent it from going to swap on disk. container orchestrators like Kubernetes, Nomad or Amazon ECS run with no swap enabled by default environments with swap enabled and additional security requirements must be encrypting their disks anyway, including disk where swap is. Cloud environments provide block storage encryption at rest by default.
I think you miss the fact that memguard also protects the sensitive data in memory. The plain secret is only accessible in a small time-span between Get
and Release
of the secret. The remaining time (which should be "most of the time") the secret is stored in an encrypted fashion. Yes, if you get hold of the enclave key you can decrypt that memory portion, but I hope you agree that this at least raises the hurdle.
Furthermore, Telegraf is not only installed in the cloud environments you mention! Asking users to disable swap for every machine hosting a Telegraf instance is probably not a good idea, wouldn't you agree?
Eventually secret data leaves secret subsystem and enters plugin. I didn't check all, but spot checked output.influxdb_v2, inputs.posgresql/postgresql_extensible and outputs.amqp, all take plaintext resolved secret []byte, make new string value and store it eslewhere for future use. I suspect majority of the plugins do the same, because they are built on top of various protocol libraries, often third party, which can't take LockBuffer as an input and plugin authors have no choice, but to provide simple string with secret data in it. As a result fully resolved plaintext data is constantly present in memory anyway.
As I wrote before, I don't think this is a argument against offering the possibility for plugins to do better! The radius input plugin for example uses a byte-slice and thus minimizes the potential leakage (there can still be copy operations inside the underlying lib). So I think the "hey everybody is insecure, so why bother" argumentation is not something we should follow.
However, for me a valid point would be the use of Telegraf in environments that either do not offer memory locking (I don't think there is any non-exotic) or the use in enviroments where you cannot control the locked-page limit. As you outlined this is the case for at-least Kubernetes (btw, how much is your limit and how many "secrets" do you have typically?) and thus I do see your point. This being said, we should think of some way to allow the user to bypass memguard explicitly (e.g. with a --insecure
or --unprotected
command-line option) instead of removing it! To achieve this, we can abstract the memguard mechanisms/interface and add a non-locked memory alternative. Would that be acceptable for you @redbaron?
Anyway, I want to discuss this within the team as it would potentially increase the maintenance effort...
@srebhan ,
Thanks for detailed response. Let me state that my main goal is to run upstream version of Telegraf in our environment and whatever approach takes us there is fine by me. --unprotected[-secret-store]
command line flag works around lack of Kubernetes ulimit
controls nicely as does your patches to memguard to use locked pages for multiple secrets. Currently our biggest Telegraf setup is ~200 secrets with just 64KB of lockable pages GKE/EKS give us.
Having said that , if threat model is to protect in-memory data from external non-root access, then Linux already solves it as-is:
CAP_SYS_PTRACE
capability enabled for the reading process. This include "plaintext secrets" which are just memory region where all environment variables are (**environ
libc pointer) as well as "resolvable" secrets after they were resolved.mprotect
ing guardian pages wont help against memory scans, because signal handlers are controlled by tracing process and SIGSEGV
generated when accessing protected page can be ignored. It has some value, as it can leave some "breadcrumbs" when unosphisticated/unprepared tracer hits it first and causes crash.mlock
can add some value.As a consequence of the above, I still maintain view that scrambling adds no value for the threats you protecting from, mprotect
is slighly more and mlock
is most valuable.
The way I understandmemguard
s main strengh is scrambling, rest of it is are higly inefficient and wastefull, so maybe dropping memguard
completely and use your own allocator to store secrets packed on mlock
ed pages surrounded by mprotect
ed guards is an option? This way you'll drop third party dependency, take feature ownerhsip entirely in-house and simplify code overall. You've done most of the work already in that memguard
PR.
As for "plaintext secrets" (#13807), all process environment variables are stored in a single continuous memory location which is swappable, mlock
ing copy of them is elsewhere has no benefit IMHO and secrets without any references to secret store should be treated as simple strings.
Interesting, it seems to be possible to cast []byte
to string
without copying:
https://cs.opensource.google/go/go/+/refs/tags/go1.21.0:src/strings/builder.go;l=49
so if new GetString()
was added to config.Secret
, then plugin could use mlock
ed value
This was a nice conversation, thanks!
As mentioned elsewhere, the use of mlock is really about hardening rather than meant as a full, security access control. I tend to think of it mostly in terms of a way to minimize swapping out the decrypted credential since the user went through all the trouble of setting up a secret store to avoid storing it on disk in the clear. The least we can do is honor that and do best effort to prevent it hitting the disk in the clear during runtime (not all swap is encrypted). It's understood that the current implementation is imperfect (which is why I think of it as "hardening for" as opposed to "protection against") and that some plugins are more effectively hardened than others. We've talked about how to improve the situation for plugins where the hardening is less effective, but in some cases that requires somewhat deep architectural changes in how plugins are invoked and other times, vendoring libraries and modifying them to handle the secrets better. Both have (sometimes significant) tradeoffs. In general, I've advocated for being transparent about the limitations and perhaps part of this issue is that we weren't transparent enough about the hardening....
All that said, we do want reasonable defaults in telegraf and defaulting to additional hardening makes sense for the general case, but your use case of a constrained environment without swap is a clear use case where the default may not be desirable. We could try to address this in several ways: a. vendor the library and apply @srebhan's changes b. interrogate the system to determine if it has enough resources and disable mlock if not (with logging) c. interrogate the system and see if swap is enabled and disable mlock if not (with logging) d. as suggested, add a config option to disable it
We've discussed 'a' before and decided that we wanted upstream comment before applying. We've discussed 'b' before and found it undesirable since the hardening measure might be removed without the user realizing it. 'c' is interesting to think about and would solve this particular case, but it feels a bit like we're being too opinionated.
I think 'd' is the right choice here so long as we make it clear what the option is doing. I suggest the following:
disable_mlock
, defaulting to false
. If it is an explicit command line option, perhaps --disable-secretstore-mlock
. Feel free to adjust for telegraf conventions<option name here>
option disables this, which might be useful in, eg, constrained environments without swap." Wordsmith as desired.@jdstrand , I agree with adding a switch to disable mlock.
There is one thing you didn't cover in your respnonse though. You said:
I tend to think of it mostly in terms of a way to minimize swapping out the decrypted credential since the user went through all the trouble of setting up a secret store to avoid storing it on disk in the clear.
Do you agree, that if config option doesn't contain references to a secret store, then whole value can be treated as "not a secret" and therefore bypass any hardening? This is what https://github.com/influxdata/telegraf/pull/13812 implements, where it detects values with no references and store them as simple go []byte
array.
Do you agree, that if config option doesn't contain references to a secret store, then whole value can be treated as "not a secret" and therefore bypass any hardening? This is what https://github.com/influxdata/telegraf/pull/13812 implements, where it detects values with no references and store them as simple go []byte array.
I'll answer more generally: conceptually, sure, if it isn't a secret then we don't need to mlock it. However @srebhan said this: "That is not the complete picture. The constant-string secret can result from a string in the TOML config option (as you mention), but it can also result from an environment variable as well as a secret-store, so it is not necessarily readable from the config file". ISTR it was an active decision to harden environment variable handling too. I defer to @srebhan on the historic details.
but it can also result from an environment variable
Right and all environment variables for any Linux process are in swappable memory already, by protecting copy of it in Go heap we don't prevent leakage through swap.
as well as a secret-store
I don't understand, constant string I am talking about is value specified in TOML without any references to secret stores.
@redbaron my suggestion is that we look into how much effort it is to support both a locked-page setup (default) as well as a non-locked page setup (enabled via explicit option). Let me be clear: the switch will be explicit, e.g. via command-line option, and not dependent on if the "secret" links to a secret-store or not. If we can support both without too much effort and branching, I'm inclined to give this a try.
However, I would need your help to test the setup thoroughly. Are you willing to help me there?
@redbaron any comment to my suggestion above?
@srebhan , yes I can help with testing. I do think that if locked (non swappable pages) is all what you want for secrets, then using mprotect
is overkill and will slow down implementing this feature though.
Use Case
memguard
as a backend for secret store requires lockable memory. Kubernetes doesn't have native way to configure RLIMIT_MEMLOCK and for that reason it is hard to run Telegraf with many "secret-capable" config options, even if they don't contain references to the secret store.I didn't dig deep what value
memguard
provides, but I assume locked memory is used so that plaintext value is not leaked to swap. Kubernetes nodes run in swap-less mode and if "secret-capable" config option doesn't contain any secret references then using locked memory has no benefit.Consider adding a switch to disable use of locked memory so that it can be enabled in environments where it is safe to do so.
Expected behavior
add agent level option
secretstore_memlock
with default valuetrue
. When set tofalse
disable use of locked memory bymemguard
Actual behavior
Locked memory requirement is enforced on Linux even if on other OSes
memguard
can run without it.Additional info
No response