Closed pat-s closed 1 year ago
The Git operations and creation of the bundle happens on the Connect node; the subsequent environment construction occurs in a separately launched pod. This error happens on the Connect node.
I am seeing the /tmp/connect-workspaces
warnings in some local testing but not the bundle write error.
The lchown
operation is attempting to assign root ownership for that app-82-622.tar.gz
file. Effectively, the operations we run look like:
git-bundle.*.tar.gz
within the TMPDIR, owned by rstudio-connect:rstudio-connect
git archive
to populate that .tar.gz
TMPDIR/git-bundle.*.tar.gz
to /var/lib/rstudio-connect/bundles/app-82-622.tar.gz
root:root
with 0600
.The file you've outlined appears to have appropriate ownership; was that file created using Git or traditional upload?
Could you compare the on-disk permissions for the file created by this Git workflow to one created through a normal uploading deploy?
Did you use Git-backed deployment before 2023.05.0? What Connect version saw success?
Thanks Aron.
So to summarize for myself (and please correct me if I got it wrong):
/tmp/connect-workspaces/connectworkspace3945610775/.config/git/attributes
warning as it's not causing any trouble hereroot:root
ownership is expected for content deployed passively via "Import from git"? And yes, these bundles came from a "git backed" content, i.e. passive publishinglchown
operation is denied?The latter could originate from file system permissions (we recently moved to azurefiles
coming from azureblob
in the AKS environment or some other AKS related policy restriction (maybe gatekeeper again?). OTOH we see the same on EKS, so I guess it's still more likely to be an image/Connect version thingy.
Could you compare the on-disk permissions for the file created by this Git workflow to one created through a normal uploading deploy?
Normal deployments all have rstudio-connect:rstudio-connect
, as it can be seen in the screenshot.
Did you use Git-backed deployment before 2023.05.0? What Connect version saw success?
No, first time in these environments at least. In others we certainly have but not all of them are on 2023.05 yet.
I've built a fresh image with 2023.06 but seeing a new issue there for the shiny-application
pods now,
rs-launcher-container flag provided but not defined: -t │
│ rs-launcher-container Usage: /opt/rstudio-connect/ext/rsc-session <options> <program> <program-args> │
│ rs-launcher-container │
│ rs-launcher-container Posit Connect process session.
because the initContainer is still running on 2023.05 and the main process can't handle the removed flag anymore. Fixed by using the preview init container image.
So can confirm I am facing the same issue on 2023.06. Investigating more...
Given the fact that it's about permissions and chown
on a shared NFS filesystem: could root_squash
be the culprit here?
And to understand better: why is "git-backed" content treated differently in terms of permissions and ownership than other content? Isn't it possible to also use rstudio-connect:rstudio-connect
for these deployments? This would maybe make the lchown
call obsolete?
I am quite surprised that the uploaded bundles are owned by rstudio-connect:rstudio-connect
. Those files are written directly by the Connect server.
Git-backed content is treated differently because we run the git archive
operation as a separate process, as the rstudio-connect
user. Once that archiving creates a .tar.gz, the chown (and subsequent chmod) adjust Git bundle permissions to be in line with the uploaded bundle permissions.
Squashing could be relevant, yes. Are you permitted to chown root:root
against one of the files that are currently owned by rstudio-connect:rstudio-connect
?
Are you permitted to chown root:root against one of the files that are currently owned by rstudio-connect:rstudio-connect?
Yes, when exec'ing into the connect server pods as the root
user.
So maybe the service account used does not have permissions then? (we're using the namespace's default one).
Apps appear to run fine after changing the bundle permissions to root:root
and 600. FYI: The differing permissions are most likely due to a file system change some weeks ago. But this does not yet explain/solve the problem when publishing 🤔
The error we see during publishing
[Connect] 2023/06/29 16:06:37.922666814 Unable to find packrat.lock in the project.
[Connect] 2023/06/29 16:06:37.922698192 Execution halted
is most likely then also just a post-result of a permission issue earlier on when manifest.json
could not be properly translated into a packrat.lock
file?
I am meanwhile speculating that we're dealing with two distinct issues here:
time="2023-06-29T17:33:40.751Z" level=error msg="error creating git bundle: lchown /var/lib/rstudio-connect/bundles/app-46-2981.tar.gz: operation not permitted" bundle_id=0 content_guid=d72a95af-9338-4dc7-9d02-1b25d58e111a content_id=46 correlation_id=0010e778-3be7-4c5c-847a-f2c2d3507315
-> I see this in 2/3 different Connect environments. In these, I can't do chown root:root
from within the pod, not even as root, which let's me assume that this is a root_squash
issue from the underlying EFS. Also the mount point of the shared storage is owned by rstudio-connect
here and not root
.packrat.lock
not found. This issue appears much later on than for (1) and in the respective environment (with azurefiles-nfs
as the file system and different from the ones in (1)) I can run chown root:root
on the bundles and we have no_root_squash
enabled (just checked again)And it's a k8s-only issue as I could successfully deploy an example app using the same Connect image in a vanilla VM deployment.
Hm. The packrat.lock
problem could be read-after-write latency. Connect sees that content only has a manifest.json
and creates a packrat.lock
so we can restore content. It then launches a pod that reads from the directory containing both files.
I think your hunch about the service account being involved might be accurate; @dbkegley mentioned to me that he saw another recent case where we had problems caused by effective permissions introduced by the service account. I'll let him chime in with details.
Thanks. I did a bit more digging:
no_root_squash
enabled. However, the access point we have provisioned has permissions set to rstudio-connect:rstudio-connect
. Hence, every bundle created is first owned by rstudio-connect
and somehow the chown root:root
is not allowed at the moment. Interestingly, this only applies to "git backed content", i.e. normal direct deployments don't care about this and work just fine. I am not entirely sure but at the moment I am suspecting that I have to migrate to a new EFS access point with 0:0
permissions instead of 999:999
to get beyond this point? :(root:root
by default, so no issue with (1). However here we are facing the packrat/manifest issue which was not present in 2023.03 (as we currently cannot deploy anything anymore without facing this issue in this environment). I am currently trying to deploy 2023.06, just to make sure I am aligning images between the environments to be able to further narrow down the individual components.(Sorry for the long (support-like) issue but I didn't anticipate that this will evolve into something like it is now)
Can you share the set of mountOptions
that are being set on the PVC/storageClass for Connect's data volume?
We have seen some permissions issues with AzureFiles in the past but I'm not sure whether that's the case here.
In another environment we found that setting uid/gid
in the mountOptions caused some permission denied errors if they are not mapped correctly to the uid/gid of Connect's RunAs user (I believe the default value in our image is 999
for both).
The pod.securityContext
settings in the rstudio-connect helm chart's values can have a similar effect if not mapped properly to Connect's RunAs user's uid/gid
Thanks David.
mount_options = ["nconnect=4", "rsize=1048576", "wsize=1048576", "hard", "timeo=600", "retrans=2", "noresvport"]
tls
In another environment we found that setting uid/gid in the mountOptions caused some permission denied errors if they are not mapped correctly to the uid/gid of Connect's RunAs user (I believe the default value in our image is 999 for both).
rstudio-connect
user. NB: we have been running with this since > 1 year without issues so far. Not setting anything here will still result in a value, i.e. AWS will set it to 50000:50000
AFAIR for the access point in question.The pod.securityContext settings in the rstudio-connect helm chart's values can have a similar effect if not mapped properly to Connect's RunAs user's uid/gid
securityContext: null
as ASK complains about privileged containers otherwise.We've been running the whole beta phase without issues and after debugging this for the whole day, I realize that some "bigger" changes must have happened in 2023.05 than I initially realized from the changelog items alone. I cannot proof that "git-backed" content has been working in 2023.03 for us but we didn't had the issues in Azure as we face them now in 2023.05. It could still be that it's due to our own images but going back is not possible (so easily) due to the DB schema changes...
@pat-s Would you mind filing a support request for this? It might be helpful to hop on a call to see this happening in real time. We haven't been able to re-create this in any of our dev/staging environments so far.
Would you mind filing a support request for this? It might be helpful to hop on a call to see this happening in real time. We haven't been able to re-create this in any of our dev/staging environments so far.
Not right now unfortunately as I don't have resources and time to go through many iterations with first level support until after some days/weeks there's potentially helpful progress. Had that too often already...
Any help is appreciated from you guys but I don't expect it or waiting for it (just to be clear) - I am aware this is not a an official way to get help for potentially environment-specific issues 🙃
😮💨😮💨😮💨 Turns out that I had a mismatch between sharedStorage.name
and config.Launcher.DataDirPVCName
. This has been, presumably, causing the uploaded bundles from the server to not appear in the deployment pods of the launcher...
It's always something like this when things "go wrong" on this level 😅
One action item that could possibly resolve this for the future: Have only one key which specifies the name of the shared storage PVC in the chart. I am not sure if two distinct ones can work together at all?
Thanks to everybody who tried to troubleshoot here!
You can unset the LauncherPvcName. We set that for you if it is not set 😄Sent from my iPhoneOn Jul 1, 2023, at 1:45 PM, Patrick Schratz @.***> wrote: Closed #394 as completed.
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you are subscribed to this thread.Message ID: @.***>
🤦 oh how embarrassing. Means all of this could have been avoided in the first place...
Wasn't there a time during the beta when setting this was mandatory? Anyhow...my fault 😓
WRT to the EFS issue and lchown
: turns out this is in fact an issue when EFS is used with access points which have uid/gid enforcements. We've had our EFS configured like that (with uid /gid set to 999 for connect). This works fine for active deployments but not so for "git-backed" content as then the chown from root
to rstudio-connect
seems not to be allowed.
So we deployed a static provisioned mount (as dynamic ones always use access points), moved the data and now everything works.
Disclaimer: we're using off-host execution for almost a year now without any issues so far. It is unclear right now whether this is an application issue or a config/image issue on our side. We are seeing both on AKS and EKS deployments, so the k8s flavor shouldn't matter here.
Example repo: https://github.com/adamjdeacon/example-shiny.git
When using 'import from git', we see
with a final error window of
This is most likely because the bundle get's created with
root:root
instead of therstudio-connect
user:I can't see why the permissions are
root:root
or what I could set to change these for git-backed content. I would assume that they are set "correct" by default?Posit Connect 2023.05.0