rstudio / helm

Helm Resources for RStudio Products
MIT License
36 stars 28 forks source link

Issues deploying "Git-backed" content (off-host execution) #394

Closed pat-s closed 1 year ago

pat-s commented 1 year ago

Disclaimer: we're using off-host execution for almost a year now without any issues so far. It is unclear right now whether this is an application issue or a config/image issue on our side. We are seeing both on AKS and EKS deployments, so the k8s flavor shouldn't matter here.

Example repo: https://github.com/adamjdeacon/example-shiny.git

When using 'import from git', we see

06/28 12:10:10.378 (GMT)
origin
Fetching git repo at /var/lib/rstudio-connect/git/github.com/adamjdeacon-example-shiny.git
06/28 12:10:11.337 (GMT)
6774f042023eaa500908026528b2e72865dfcccf refs/heads/master
06/28 12:10:11.480 (GMT)
warning: unable to access '/tmp/connect-workspaces/connectworkspace3945610775/.config/git/attributes': Permission denied
06/28 12:10:11.484 (GMT)
warning: unable to access '/tmp/connect-workspaces/connectworkspace3945610775/.config/git/attributes': Permission denied

with a final error window of

image

This is most likely because the bundle get's created with root:root instead of the rstudio-connect user:

image

I can't see why the permissions are root:root or what I could set to change these for git-backed content. I would assume that they are set "correct" by default?

Posit Connect 2023.05.0

aronatkins commented 1 year ago

The Git operations and creation of the bundle happens on the Connect node; the subsequent environment construction occurs in a separately launched pod. This error happens on the Connect node.

I am seeing the /tmp/connect-workspaces warnings in some local testing but not the bundle write error.

The lchown operation is attempting to assign root ownership for that app-82-622.tar.gz file. Effectively, the operations we run look like:

  1. Create an empty temporary git-bundle.*.tar.gz within the TMPDIR, owned by rstudio-connect:rstudio-connect
  2. Run git archive to populate that .tar.gz
  3. Copy the TMPDIR/git-bundle.*.tar.gz to /var/lib/rstudio-connect/bundles/app-82-622.tar.gz
  4. Set ownership to root:root with 0600.

The file you've outlined appears to have appropriate ownership; was that file created using Git or traditional upload?

Could you compare the on-disk permissions for the file created by this Git workflow to one created through a normal uploading deploy?

Did you use Git-backed deployment before 2023.05.0? What Connect version saw success?

pat-s commented 1 year ago

Thanks Aron.

So to summarize for myself (and please correct me if I got it wrong):

The latter could originate from file system permissions (we recently moved to azurefiles coming from azureblob in the AKS environment or some other AKS related policy restriction (maybe gatekeeper again?). OTOH we see the same on EKS, so I guess it's still more likely to be an image/Connect version thingy.

Could you compare the on-disk permissions for the file created by this Git workflow to one created through a normal uploading deploy?

Normal deployments all have rstudio-connect:rstudio-connect, as it can be seen in the screenshot.

Did you use Git-backed deployment before 2023.05.0? What Connect version saw success?

No, first time in these environments at least. In others we certainly have but not all of them are on 2023.05 yet.

I've built a fresh image with 2023.06 but seeing a new issue there for the shiny-application pods now,

rs-launcher-container flag provided but not defined: -t                                                                                                                             │
│ rs-launcher-container Usage: /opt/rstudio-connect/ext/rsc-session <options> <program> <program-args>                                                                                │
│ rs-launcher-container                                                                                                                                                               │
│ rs-launcher-container Posit Connect process session.

because the initContainer is still running on 2023.05 and the main process can't handle the removed flag anymore. Fixed by using the preview init container image.

So can confirm I am facing the same issue on 2023.06. Investigating more...

pat-s commented 1 year ago

Given the fact that it's about permissions and chown on a shared NFS filesystem: could root_squash be the culprit here?

pat-s commented 1 year ago

And to understand better: why is "git-backed" content treated differently in terms of permissions and ownership than other content? Isn't it possible to also use rstudio-connect:rstudio-connect for these deployments? This would maybe make the lchown call obsolete?

aronatkins commented 1 year ago

I am quite surprised that the uploaded bundles are owned by rstudio-connect:rstudio-connect. Those files are written directly by the Connect server.

Git-backed content is treated differently because we run the git archive operation as a separate process, as the rstudio-connect user. Once that archiving creates a .tar.gz, the chown (and subsequent chmod) adjust Git bundle permissions to be in line with the uploaded bundle permissions.

Squashing could be relevant, yes. Are you permitted to chown root:root against one of the files that are currently owned by rstudio-connect:rstudio-connect?

pat-s commented 1 year ago

Are you permitted to chown root:root against one of the files that are currently owned by rstudio-connect:rstudio-connect?

Yes, when exec'ing into the connect server pods as the root user.

So maybe the service account used does not have permissions then? (we're using the namespace's default one).

Apps appear to run fine after changing the bundle permissions to root:root and 600. FYI: The differing permissions are most likely due to a file system change some weeks ago. But this does not yet explain/solve the problem when publishing 🤔

The error we see during publishing

[Connect] 2023/06/29 16:06:37.922666814   Unable to find packrat.lock in the project.
[Connect] 2023/06/29 16:06:37.922698192 Execution halted

is most likely then also just a post-result of a permission issue earlier on when manifest.json could not be properly translated into a packrat.lock file?

pat-s commented 1 year ago

I am meanwhile speculating that we're dealing with two distinct issues here:

  1. time="2023-06-29T17:33:40.751Z" level=error msg="error creating git bundle: lchown /var/lib/rstudio-connect/bundles/app-46-2981.tar.gz: operation not permitted" bundle_id=0 content_guid=d72a95af-9338-4dc7-9d02-1b25d58e111a content_id=46 correlation_id=0010e778-3be7-4c5c-847a-f2c2d3507315 -> I see this in 2/3 different Connect environments. In these, I can't do chown root:root from within the pod, not even as root, which let's me assume that this is a root_squash issue from the underlying EFS. Also the mount point of the shared storage is owned by rstudio-connect here and not root.
  2. The second issue is that we face the packrat.lock not found. This issue appears much later on than for (1) and in the respective environment (with azurefiles-nfs as the file system and different from the ones in (1)) I can run chown root:root on the bundles and we have no_root_squash enabled (just checked again)

And it's a k8s-only issue as I could successfully deploy an example app using the same Connect image in a vanilla VM deployment.

aronatkins commented 1 year ago

Hm. The packrat.lock problem could be read-after-write latency. Connect sees that content only has a manifest.json and creates a packrat.lock so we can restore content. It then launches a pod that reads from the directory containing both files.

I think your hunch about the service account being involved might be accurate; @dbkegley mentioned to me that he saw another recent case where we had problems caused by effective permissions introduced by the service account. I'll let him chime in with details.

pat-s commented 1 year ago

Thanks. I did a bit more digging:

  1. I found that EFS has by default no_root_squash enabled. However, the access point we have provisioned has permissions set to rstudio-connect:rstudio-connect. Hence, every bundle created is first owned by rstudio-connect and somehow the chown root:root is not allowed at the moment. Interestingly, this only applies to "git backed content", i.e. normal direct deployments don't care about this and work just fine. I am not entirely sure but at the moment I am suspecting that I have to migrate to a new EFS access point with 0:0 permissions instead of 999:999 to get beyond this point? :(
  2. For the Azure environment in question, permissions of the NFS share are actually root:root by default, so no issue with (1). However here we are facing the packrat/manifest issue which was not present in 2023.03 (as we currently cannot deploy anything anymore without facing this issue in this environment). I am currently trying to deploy 2023.06, just to make sure I am aligning images between the environments to be able to further narrow down the individual components.

(Sorry for the long (support-like) issue but I didn't anticipate that this will evolve into something like it is now)

dbkegley commented 1 year ago

Can you share the set of mountOptions that are being set on the PVC/storageClass for Connect's data volume? We have seen some permissions issues with AzureFiles in the past but I'm not sure whether that's the case here.

In another environment we found that setting uid/gid in the mountOptions caused some permission denied errors if they are not mapped correctly to the uid/gid of Connect's RunAs user (I believe the default value in our image is 999 for both).

The pod.securityContext settings in the rstudio-connect helm chart's values can have a similar effect if not mapped properly to Connect's RunAs user's uid/gid

pat-s commented 1 year ago

Thanks David.

In another environment we found that setting uid/gid in the mountOptions caused some permission denied errors if they are not mapped correctly to the uid/gid of Connect's RunAs user (I believe the default value in our image is 999 for both).

The pod.securityContext settings in the rstudio-connect helm chart's values can have a similar effect if not mapped properly to Connect's RunAs user's uid/gid

We've been running the whole beta phase without issues and after debugging this for the whole day, I realize that some "bigger" changes must have happened in 2023.05 than I initially realized from the changelog items alone. I cannot proof that "git-backed" content has been working in 2023.03 for us but we didn't had the issues in Azure as we face them now in 2023.05. It could still be that it's due to our own images but going back is not possible (so easily) due to the DB schema changes...

dbkegley commented 1 year ago

@pat-s Would you mind filing a support request for this? It might be helpful to hop on a call to see this happening in real time. We haven't been able to re-create this in any of our dev/staging environments so far.

pat-s commented 1 year ago

Would you mind filing a support request for this? It might be helpful to hop on a call to see this happening in real time. We haven't been able to re-create this in any of our dev/staging environments so far.

Not right now unfortunately as I don't have resources and time to go through many iterations with first level support until after some days/weeks there's potentially helpful progress. Had that too often already...

Any help is appreciated from you guys but I don't expect it or waiting for it (just to be clear) - I am aware this is not a an official way to get help for potentially environment-specific issues 🙃

Logging here what I've tried: - [ ] Tried with a fresh instance (new PVC, new DB) with 2023.06 (same issue) (Azure) - [ ] Tried with `azureblob` instead of `azurefiles` - [ ] Tried with `azurefiles` but without mount options - [ ] Tried with `ghcr.io/rstudio/content-base:r4.2.2-py3.11.3-ubuntu2204` instead of our image - [ ] Tried with 2023.03 I've then tried to compare what happens exactly in a working environment and a failing one by trying to deploy the git-backed content from https://github.com/adamjdeacon/example-shiny.git: Working deployment (fresh Connect instance, EKS 1.27, EFS without access point, static provisioning, Connect 2023.05) ``` root@connect-554cbbc95d-6lpb6:/# ls -la /var/lib/rstudio-connect/apps/2/2/ total 100 drwxr-x---. 3 rstudio-connect rstudio-connect 6144 Jul 1 09:02 . drwx------. 3 root root 6144 Jul 1 09:01 .. -rw-r-----. 1 rstudio-connect rstudio-connect 1241 Feb 24 2021 app.R -rw-r-----. 1 rstudio-connect rstudio-connect 205 Feb 24 2021 example-shiny.Rproj -rw-r-----. 1 rstudio-connect rstudio-connect 40 Feb 24 2021 .gitignore -rw-r-----. 1 rstudio-connect rstudio-connect 0 Jul 1 09:02 .here -rw-r-----. 1 rstudio-connect rstudio-connect 67988 Feb 24 2021 manifest.json drwxr-x---. 4 rstudio-connect rstudio-connect 6144 Jul 1 09:02 packrat -rw-r-----. 1 rstudio-connect rstudio-connect 16 Feb 24 2021 README.md ``` Failing deployment (Azure with azurefiles/blob, AKS 1.26, dynamic provisioning, Connect 2023.05) ``` root@connect-7b7666c8b6-85b2f:/# ls -la /var/lib/rstudio-connect/apps/8/9/ total 70 drwxr-x--- 2 rstudio-connect rstudio-connect 0 Jul 1 09:09 . drwx------ 2 root root 0 Jul 1 09:09 .. -rw-r----- 1 rstudio-connect rstudio-connect 1241 Feb 24 2021 app.R -rw-r----- 1 rstudio-connect rstudio-connect 205 Feb 24 2021 example-shiny.Rproj -rw-r----- 1 rstudio-connect rstudio-connect 40 Feb 24 2021 .gitignore -rw-r----- 1 rstudio-connect rstudio-connect 67988 Feb 24 2021 manifest.json drwxr-x--- 2 rstudio-connect rstudio-connect 0 Jul 1 09:09 packrat -rw-r----- 1 rstudio-connect rstudio-connect 16 Feb 24 2021 README.md ``` Note the size of `0` for the `packrat` dir in the failing example. I think this is why the error is thrown WRT to `packrat.lock` not found. In fact, `packrat/packrat.lock` exists ``` root@connect-7b7666c8b6-85b2f:/# ls -la /var/lib/rstudio-connect/apps/8/9/packrat/ total 2 drwxr-x--- 2 rstudio-connect rstudio-connect 0 Jul 1 09:09 . drwxr-x--- 2 rstudio-connect rstudio-connect 0 Jul 1 09:09 .. drwxr-x--- 2 rstudio-connect rstudio-connect 0 Jul 1 09:09 desc -rw-r----- 1 rstudio-connect rstudio-connect 0 Jul 1 09:09 .manrat -rw-r----- 1 rstudio-connect rstudio-connect 1984 Jul 1 09:09 packrat.lock ``` but the files in the dir all have size `0` besides `packrat.lock`. Permissions seem to be correct, i.e. all the same in both examples. I think that the files are already "faulty" before they go into the deployment pod. The stats from above come from the shared PV accessed from the server pod. The Q is now: what causes the files under `apps/packrat` to be created with a file size of 0? Owner and permissions are the same. But `packrat` is the only content which is newly created, the other files come from the remote source.
pat-s commented 1 year ago

😮‍💨😮‍💨😮‍💨 Turns out that I had a mismatch between sharedStorage.name and config.Launcher.DataDirPVCName. This has been, presumably, causing the uploaded bundles from the server to not appear in the deployment pods of the launcher...

It's always something like this when things "go wrong" on this level 😅

One action item that could possibly resolve this for the future: Have only one key which specifies the name of the shared storage PVC in the chart. I am not sure if two distinct ones can work together at all?

Thanks to everybody who tried to troubleshoot here!

colearendt commented 1 year ago

You can unset the LauncherPvcName. We set that for you if it is not set 😄Sent from my iPhoneOn Jul 1, 2023, at 1:45 PM, Patrick Schratz @.***> wrote: Closed #394 as completed.

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you are subscribed to this thread.Message ID: @.***>

pat-s commented 1 year ago

🤦 oh how embarrassing. Means all of this could have been avoided in the first place...

Wasn't there a time during the beta when setting this was mandatory? Anyhow...my fault 😓

WRT to the EFS issue and lchown: turns out this is in fact an issue when EFS is used with access points which have uid/gid enforcements. We've had our EFS configured like that (with uid /gid set to 999 for connect). This works fine for active deployments but not so for "git-backed" content as then the chown from root to rstudio-connect seems not to be allowed.

So we deployed a static provisioned mount (as dynamic ones always use access points), moved the data and now everything works.