ubuntu / WSL

Ubuntu WSL image build and launcher code.
MIT License
151 stars 45 forks source link

feat: Check init tasks #478

Closed CarlosNihelton closed 4 months ago

CarlosNihelton commented 5 months ago

This affects only the registration time. Subsequent boots don't check for cloud-init or depend on it in any way.

If the distro launcher is used to register a WSL instance, we check if there is a cloud-init command in PATH, and if so we launch cloud-init status --wait right after registration is complete. This will wait until cloud-init completes what's doing or report error/disabled states. By having a Win32 process mapped to that command seems enough to make WSL consider the boot is complete, thus preventing WSL boot time watchdog to kick in.


EDIT: Cloud-init could have created Linux user accounts. Users could have been created in many other ways, even pre-created during image build. We expect the desired default user to be set via /etc/wsl.conf. That would work even without the distro launcher, so it's reasonable to assume the desired default to be present there.

We "parse" that file and, if there is the default user entry, then we set it via WSL API (which is faster than terminanting and restarting the instance). In order to leverage Windows primitives for parsing INI, we have to make a temporary copy of that file into the Windows filesystem.

If we don't find a default user set, we check if the current default user is still root, in which case we look for any non-system user in the NSS passwd database (by launching getent), i.e. any user with the lowest UID greather than or equals to 1000 and have a real login shell (no /bin/sync,/bin/false or /usr/bin/nologin).

Finally, we fallback to the interative user creation, i.e. the default upstream prompting.


I originally intended to put this implementation inside DistributionInfo.h/cpp to ensure less files being touched (thus less diff compared to upstream) but that doesn't play nice with CI 'Refresh releases and assets' workflow.

So I placed the code in its own component, and inside a subdirectory, so if there comes a time we need to revert that change (it becomes obsolete or so) it remains easy to drop. Also, being inside a subdirectory allows me to drop a .clang-format in there without worrying of reformatting upstream code :)

Lessons learned from the launcher's old times:

I didn't add too many test cases because I was afraid of taking too long to register/unregister in between tests. I also avoided adding stuff that could make the tests brittle such as installing huge apt packages (in fact I'm installing nothing at all at this moment), attaching to Ubuntu Pro, enrolling into Landscape and what not. Up for discussion on what makes sense to add here. I also didn't play with cloud-init user-data file selection. Although I could, I believe that's better tested as a unit test of the datasource in the upstream repository itself. Up for discussion as well.

Finally we cannot use private directories to behave as %USERPROFILE%, we need to put stuff inside the real user profile directory on Windows, which is sad. We can set %USERPROFILE% in the Windows shell appropriately and subprocesses of the launcher (either Linux or Windows processes) find the correct value, but systemd services won't. They will still match up the global %USERPROFILE%, one reason why I struggled to implement those tests.


UDENG-2768

CarlosNihelton commented 5 months ago

Before going on with the PR, I think we should clarify this:

If we don't find a default user set, than we allow fallback to the interative user creation, i.e. the default upstream prompting.

I think we said that if we don’t find anything in wsl.conf, we still list all users and take the first one with UID >= 1000? I can imagine people using cloud-init use the standard user creation creation but don’t know about wsl.conf and we want to support them.

My suggestion would be, once the launcher waited on cloud-init as you did in that PR:

  • look at wsl.conf, if there is a value there, ensure the user exists on the system (with getent passwd), update the windows registry
  • if nothing, getent passwd list all users with UID >= 1000. If found, update wsl.conf and the windows registry
  • otherwise, fallback to interactive user creation.

Does this make sense?

While it does make sense, I've been resisting to move towards that direction because it becomes harder (IMHO) to explain to users why certain cloud-init user-data behaves in one way when the distro is set up via the launcher and differently when through wsl --import for example. To date I still struggle to find a clear way to present this difference in behavior as a feature of the distro launcher so users won't just assume coming straight from cloud-init and being surprised when it doesn't happen, other than relying on Ubuntu WSL documentation. I think that can be confusing. Do you still believe it's worth investing in doing this augmentation of cloud-init in the distro launcher?

didrocks commented 5 months ago

It does not depend on cloud-init directly as we don’t really care the way the user was created on the machine. I can be a pre-created image with a default user and they didn’t create/know about wsl.conf.

We don’t want those users to grow that knowledge and the launcher should do the right thing silently.

@jibel opinions?

jibel commented 4 months ago

It does not depend on cloud-init directly as we don’t really care the way the user was created on the machine. I can be a pre-created image with a default user and they didn’t create/know about wsl.conf.

We don’t want those users to grow that knowledge and the launcher should do the right thing silently.

@jibel opinions?

Looking at the order of priority to consider a default user for WSL it's: wsl.conf > registry wsl considers nothing else.

So we should check:

Then create an entry in wsl.conf based on this discovery if there is none so wsl -d starts with the default user

CarlosNihelton commented 4 months ago

I changed the PR description to match the current implementation after our discussion. The changes in the description are marked as shown below:


EDIT: ...


CarlosNihelton commented 4 months ago

Let me interject on this:

We told that the order is (from what wsl command is considering): wsl.conf key in registry If I am correct, we don’t check the registry here, only set it. I think similarly than wsl.conf, we should consider it as "a default has been set".

The function bool enforceDefaultUser(WslApiLoader& api) does the following order:

  1. stores all user entries from getent (used for name/uid matching in the subsequent steps)
  2. reads /etc/wsl.conf
  3. checks if there is one default set via WSL API (which is the same as checking the registry -- via if (auto uid = DistributionInfo::QueryUid(L""); uid != 0) {
  4. finally searches for a non-system user in the collection stored in 1.

Step 3 needs a comment in the code to clarify that behavior, but in fact it's checking for the registry by its effects. Under the hood DistributionInfo::QueryUid(L"") runs id -u (without specifying the username), effectively checking the uid of the current user running that command, i.e. the current default user.

If one manually sets the relevant registry key or runs WslConfigureDistribution(uid, flags), the effect is immediate, i.e. the next WslLaunch or WslLaunchInteractive will run commands as that user. /etc/wsl.conf is more problematic. Changes in /etc/wsl.conf only become effective upon instance restart (not the VM, but the particular instance).

So, consider that a rootfs comes with boot.systemd=true out of the shelf. The first command we'd launch after registering an instance from that rootfs will have systemd running, because during boot WSL already sees that definition. Default user by then is root (nothing set the default user yet). Let's assume the first command is then: cloud-init status --wait and, by the time cloud-init finishes, it writes user.default=ubuntu into /etc/wsl.conf. The default user will remain root, because the instance was not rebooted yet. That's why when we read /etc/wsl.conf right after cloud-init finishes its job we set the default user via WSL API (or reboot the instance - not the VM, just the instance).

I'll clarify that in the code, but otherwise I don't believe we need to read the registry directly. Even because that would be more prone to races than what we're doing now.

CarlosNihelton commented 4 months ago

About us writing into /etc/wsl.conf it was more for consistency, as proposed by jibel on this comment:

So we should check:

wsl.conf registry first user >= 1000 (possibly created by cloud-init or a wsl import or something else we don't know) interactive mode Then create an entry in wsl.conf based on this discovery if there is none so wsl -d starts with the default user

I think we can even skip writing that file, because if we set the default user via WSL API (which is the same as writing into the registry) then wsl -d runs with the default user set unless specified otherwise by the u user option.

By avoiding writing it we avoid "fixing" it at the same time. What do you guys think? @didrocks @jibel

didrocks commented 4 months ago

Ack on reading the API indirectly from the registry (yes, a comment on this will help to clarify it) :)

I think we can even skip writing that file, because if we set the default user via WSL API (which is the same as writing into the registry) then wsl -d runs with the default user set unless specified otherwise by the u user option.

If I am correct, this is still checking the user exists and fix the consistency issue no by changing the registry? Again, I’m not close to the code, but I think basically the source of truth should be:

  1. if wsl.conf is there and wrong (by wrong, I mean the user doesn’t exist), so be it, let’s the already set value in it
  2. if the registry is there and wrong, same paradigm, we don’t do any consistency check and don’t enforce/overwrite ourself.

Basically, we only update one or the two of them, depending on what you told before, if both of them are unset. This is my 2 cents.

CarlosNihelton commented 4 months ago

Ack on reading the API indirectly from the registry (yes, a comment on this will help to clarify it) :)

I think we can even skip writing that file, because if we set the default user via WSL API (which is the same as writing into the registry) then wsl -d runs with the default user set unless specified otherwise by the u user option.

If I am correct, this is still checking the user exists and fix the consistency issue no by changing the registry? Again, I’m not close to the code, but I think basically the source of truth should be:

  1. if wsl.conf is there and wrong (by wrong, I mean the user doesn’t exist), so be it, let’s the already set value in it
  2. if the registry is there and wrong, same paradigm, we don’t do any consistency check and don’t enforce/overwrite ourself.

Basically, we only update one or the two of them, depending on what you told before, if both of them are unset. This is my 2 cents.

I totally agree on error out if /etc/wsl.conf is wrong. We still have an issue if that file is correct: we'll start the shell as root, because the instance didn't reboot. Setting the UID matching the name in /etc/wsl.conf would prevent that behavior. Otherwise we must restart the instance, which is slower, but allows any change in that conf file to take effect (which might be more than just setting the default user in the end).

jibel commented 4 months ago

About us writing into /etc/wsl.conf it was more for consistency, as proposed by jibel on this comment:

So we should check: wsl.conf registry first user >= 1000 (possibly created by cloud-init or a wsl import or something else we don't know) interactive mode Then create an entry in wsl.conf based on this discovery if there is none so wsl -d starts with the default user

I think we can even skip writing that file, because if we set the default user via WSL API (which is the same as writing into the registry) then wsl -d runs with the default user set unless specified otherwise by the u user option.

By avoiding writing it we avoid "fixing" it at the same time. What do you guys think? @didrocks @jibel

I think it is te correct behaviour. If one user is created by cloud-init and at the same time wsl.conf is created from cloud-init but with a different user, we would have the first user in the registry, but wsl.conf would take the precedence and use another user. This is the behaviour we want.

What happends if several users are created by cloud-init? Which one would be in the registry?

CarlosNihelton commented 4 months ago

What happends if several users are created by cloud-init? Which one would be in the registry?

Assuming nothing else sets the default user by then, the launcher will iterate over the list of users collected from the passwd database and set as default the lowest UID greater than 1000, which matches the first user written in the cloud-config user-data, per my observation when writing tests (assuming only cloud-init would create non-system users in that scenario).

CarlosNihelton commented 4 months ago

I think some condition are coming from the previous implementation

Replied inline, the condition has meaning to the current test cases leveraging it.

Also, I feel that maybe we can augment the tests with pre-shipped /etc/passwd, where we have users with uid, but no shell, and various corner cases like this.

I added a few more test cases in which cloud-init plays a double-agent role, doing its expected job but also deliverying broken /etc/passwd files that could mess with our logic.

Note: I don’t think the edit of the description is up to date though. You are still saying: "user in the NSS passwd database (by launching getent), i.e. any user with UID between 1000 and 65534 (the nobody)."

I updated the PR description.

I think this covers all the remaining concerns. I'm still being careful with the amount of times we register/unregister WSL instances to avoid too long tests.

CarlosNihelton commented 4 months ago

Looks like I need to rebase due doc changes. I'm marking the new commits as usual.

CarlosNihelton commented 4 months ago

Looks like I need to rebase due doc changes. I'm marking the new commits as usual.

Rebasing was not exactly needed nor was that the fix, but rather telling bash.exe about handling symlinks on WIndows. image