microsoft / Windows-Containers

Welcome to our Windows Containers GitHub community! Ask questions, report bugs, and suggest features -- let's work together.
MIT License
395 stars 60 forks source link

Key Storage Provider registration leak in process isolation #263

Closed fschmied closed 11 months ago

fschmied commented 2 years ago

Describe the bug When registering a Key Storage Provider (KSP) via BCryptRegisterProvider inside a Windows container based on the Windows Server Core LTSC 2019 image in process isolation (on a Windows Server 2019 host machine), the KSP registration leaks:

To Reproduce

  1. Create a Windows Server 2019 VM (e.g., an Azure VM with the 2019-datacenter-gensecond image).
  2. Install the Mirantis Container Runtime and reboot the VM.
  3. Build a container image that contains a custom KSP DLL and uses BCryptRegisterProvider to register it.
  4. List KSPs on the host machine (e.g., using certutil -csplist) and notice that the KSP in question is not registered.
  5. Run a container based on the built image in process isolation. The BCryptRegisterProvider call will succeed. Remove the container.
  6. Run a second container (also in process isolation). The BCryptRegisterProvider call will fail with error code 0xC0000035 (-1073741771, STATUS_OBJECT_NAME_COLLISION). Remove the container.
  7. List KSPs on the host machine again and notice that the KSP in question is now registered on the host machine (but the provider DLL cannot be found).

I can provide a full repro sample (i.e., a custom KSP and a Dockerfile to build the container), but cannot post it here. Ping me if you need it.

Expected behavior I'd expect process isolation to isolate the BCryptRegisterProvider call in step 5. Therefore, the second container should not have the KSP registration (in step 6). The host should not have it either (in step 7).

Configuration:

Additional context A hint that process isolation is broken has already come up in https://github.com/microsoft/Windows-Containers/issues/89#issuecomment-771115708. The whole issue #89 might be related as it's also about KSP configuration not playing well with containers.

We'd need this to be fixed on Windows Server 2019 LTSC as our customers are using that OS version.

michbern-ms commented 2 years ago

@fschmied Thank you for the clear repro! I agree with your thought that this is related to #89. We already have engineering work in progress to remedy #89 , so once that work is complete, we can ascertain whether it also fixes this issue or whether this issue is now.

For my education: what is the consequence of the leaking KSP? On #89, if the KSP doesn't stay registered, it clearly causes a problem for the app that wants to use it. In this case, we have an extra KSP in container #2. Is there a way to write the registration logic to say, "if the KSP is not already registered, register it?"

fschmied commented 2 years ago

@michbern-ms It's great to hear the team have already started working on #89, so hopefully, they will be able to address this issue here as well. Have you any indication of how long it might take for them to know if they've also got our problem here covered? We were considering also reporting it via Microsoft PSS, but if you're already actively working on it, that would probably not be necessary, I guess?

For my education: what is the consequence of the leaking KSP? On https://github.com/microsoft/Windows-Containers/issues/89, if the KSP doesn't stay registered, it clearly causes a problem for the app that wants to use it. In this case, we have an extra KSP in container https://github.com/microsoft/Windows-Containers/issues/2. Is there a way to write the registration logic to say, "if the KSP is not already registered, register it?"

So far, we've seen two consequences:

  1. The KSP registration leaks into other containers running on the same host as well as to the host itself.
  2. The KSP registration process seems to also influence other KSPs on the host.

For the first item, we've so far been able to work around in a similar way as you've suggested: if already registered, ignore the error code. That seems to work if all the registration details (KSP aliases, provider DLL name) are the same.

However, we've not been able to work around the second item above as easily. In that case, a different KSP suddenly stopped working correctly on the host after we'd installed our KSP inside the container. In particular, CSP to KSP "lightup" stopped working.

Lighup is a process where a reference to a CSP (Crypto Service Provider) is "translated" into a reference to a compatible KSP by means of KSP aliases (see, for example, this link for an implementation). So, after we'd installed our own KSP inside a container, we saw that lightup with the SafeNet KSP stopped working on the host until that host was restarted. (We weren't able to investigate more deeply; it's possible that the SafeNet KSP stopped working entirely, not just with regards to lightup.)

So far, the only way we've found to work around this was by switching to Hyper-V isolation. However, this won't be possible if our container images are run in a Kubernetes setup, for example, where only process isolation is supported.

michbern-ms commented 2 years ago

@fschmied Thank you for the additional detail, that is very helpful. I just communicated that detail to the engineering team. I always hesitate to estimate on behalf of other engineering teams, but I'm hoping to hear back from them in a couple of weeks.

ghost commented 1 year ago

This issue has been open for 30 days with no updates. , please provide an update or close this issue.

fschmied commented 1 year ago

@michbern-ms I guess it's still too early for an update, right?

michbern-ms commented 1 year ago

@fschmied Never hurts to ask. :-) The good news is, the engineering team is confident that your issue is covered as well as #89. The bad news is, the fix is a large amount of enlightenment for the whole crypto configuration system and that work is taking some time to stabilize. I sync with that team every week. It's not very satisfying to come back and say, "we're working on it,", but well, that's the truth. I appreciate your following up to see how the work is going.

ghost commented 1 year ago

This issue has been open for 30 days with no updates. , please provide an update or close this issue.

fschmied commented 1 year ago

Just replying to tell the bot we're still interested in this issue :) .

fady-azmy-msft commented 1 year ago

Commenting for the bot; we're still working on this and don't have an update to share yet.

fschmied commented 1 year ago

@fady-azmy-msft Thanks for the update - on behalf of the bot and of myself 🙂.

ghost commented 1 year ago

This issue has been open for 30 days with no updates. , please provide an update or close this issue.

fschmied commented 1 year ago

@fady-azmy-msft @michbern-ms Any news to share? 🙂

michbern-ms commented 1 year ago

I checked the issue status this morning and one of the key crypto developers completed a PR into a feature branch for this issue. It is indeed a high impact PR. The next step is to confirm that the scenario documented here is working properly in pre-release builds.

fschmied commented 1 year ago

Cool, thanks for the update! We keep running into this, so I'm happy to hear about the PR 🙂 .

microsoft-github-policy-service[bot] commented 1 year ago

This issue has been open for 30 days with no updates. no assignees, please provide an update or close this issue.

microsoft-github-policy-service[bot] commented 1 year ago

This issue has been open for 30 days with no updates. no assignees, please provide an update or close this issue.

fschmied commented 1 year ago

@michbern-ms Anything new?

michbern-ms commented 1 year ago

@fschmied Good question. Due to the size of the code change, we've been doing a bunch of internal flighting at Microsoft of the changes, now running across hundreds of internal Windows users. I think we're just about done with that, so the next step is to port the work to the Windows Server 2022 servicing branches.

I'm often inclined to apologize for how long this takes, but with >1 billion Windows customers out there, it does make sense for us to take appropriate care with big changes like this one. So, please stay with us. I review the status on this issue every single week!

fschmied commented 1 year ago

Thanks for the update! Do you know whether the change will also be ported to Windows Server 2019?

michbern-ms commented 1 year ago

I am not sure - we're still assessing the risk there. The older the OS, the more challenging it is to adapt newer code changes to it.

microsoft-github-policy-service[bot] commented 1 year ago

This issue has been open for 30 days with no updates. no assignees, please provide an update or close this issue.

microsoft-github-policy-service[bot] commented 1 year ago

This issue has been open for 30 days with no updates. no assignees, please provide an update or close this issue.

microsoft-github-policy-service[bot] commented 1 year ago

This issue has been open for 30 days with no updates. no assignees, please provide an update or close this issue.

microsoft-github-policy-service[bot] commented 1 year ago

This issue has been open for 30 days with no updates. no assignees, please provide an update or close this issue.

michbern-ms commented 1 year ago

@fschmied I was hoping to be able to reply with a concrete date, but the servicing teams have been slammed with security issues and backports. Still, I did confirm that we're backporting the change to both Server 2019 and Server 2022. Thanks for your patience: I promise, I check on the status of this every week.

fschmied commented 1 year ago

Thanks for the update, the backport for 2019 is very good news for us!

microsoft-github-policy-service[bot] commented 1 year ago

This issue has been open for 30 days with no updates. no assignees, please provide an update or close this issue.

ulrichb commented 1 year ago

Dear beloved bot,

please keep this ticket open as it's still in progress ...

michbern-ms commented 1 year ago

Yeah, the beloved bot is rubbing salt in the wound. I can confirm that the backport is still in plan for Server 2019 and Server 2022; we're waiting for our "launch window", so to speak. I really do poke this every week; I will be as happy as anyone to get this fix done!

microsoft-github-policy-service[bot] commented 1 year ago

This issue has been open for 30 days with no updates. no assignees, please provide an update or close this issue.

l3hgs-brent commented 12 months ago

We're experiencing an issue which is likely related to this with the DigiCert(R) KSP in a Docker container using process isolation. Specifically, the registered KSPs do not persist.

Our unattended code-signing process uses Windows 10 for both the Docker host and container. I only see Windows Server mentioned both here and in issue 89. Are there any plans to also provide updates for Windows 10?

fschmied commented 11 months ago

@michbern-ms I guess https://github.com/microsoft/Windows-Containers/issues/89#issuecomment-1716555803 also applies to this issue here and the KSP registration leak is fixed in the servicing releases from September 12, 2023?

The comment says there is a problem with the Venafi KSP, is there a risk that the same problem will manifest with other KSPs too once we install the servicing releases? I.e., might our containers stop working if they install a KSP? (With Azure Kubernetes Services, we don't control the installation of servicing releases on the nodes, so this might pose a big risk for us.)

michbern-ms commented 11 months ago

@fschmied Yes, I expect the KSP registration leak to be fixed by the servicing releases from 9/12/2023.

The issue that manifests with the Venafi KSP is independent of the crypto-config separation work. (i.e. we can reproduce the issue both before and after the September servicing releases.) So I don't have any reason to think the new servicing release will have any negative impact on a KSP that is already functioning properly.

Thanks, Michael

fschmied commented 11 months ago

Okay, that's a relief. Thanks for the transparent communication!

fschmied commented 11 months ago

We've now tested the behavior after applying the mentioned Windows updates on Windows Server 2019 and have not been able to reproduce the effects I reported in this issue. I'm thus closing it. Thank you for fixing the problem!

michbern-ms commented 11 months ago

Fantastic - thank you for confirming. And thanks for your patience on this -- it really was signfiicant work for the platform and I'm glad we were able to get it backported to Server 2019 and Server 2022.