Closed lemeurherve closed 6 months ago
Current references to blobxfer
(excluding some updatecli manifests and co):
Corresponding storage accounts:
prodpluginsite
, File Share: pluginsite
prodjenkinsio
, File Share: jenkinsio
(and also cnjenkinsio
& zhjenkinsio
)prodjavadoc
, File Share: javadoc
prodjenkinsreleases
, File Shares: mirrorbits
& website
To replace blobxfer
by azcopy
for manipulating these files shares, SAS tokens need to be generated and stored as credentials (blobxfer is using the access key of the storage accounts, much less grain fined).
To do so, these storage accounts and file shares have to be imported as code in https://github.com/jenkins-infra/azure first.
The plan is to:
prodpluginsite
, File Share: pluginsite
prodjenkinsio
, File Share: jenkinsio
prodjavadoc
, File Share: javadoc
prodjenkinsreleases
, File Shares: mirrorbits
& website
blobxfer
by azcopy
(cf refs above)
blobxfer
from our builder image and VMs.
Note: if time permits, we should also replace az storage
by azcopy
in infra.publishReports
: https://github.com/jenkins-infra/pipeline-library/blob/350e1b561b633ae204265ce89bca20edfce3c02f/vars/publishReports.groovy#L55-L59
To replace blobxfer by azcopy for manipulating these files shares, SAS tokens need to be generated and stored as credentials (blobxfer is using the access key of the storage accounts, much less grain fined). To do so, these storage accounts and file shares have to be imported as code in jenkins-infra/azure first.
Ideally a service principal / managed identity / workload identity should be used instead. SaS tokens are irrevocable and are worse than account keys in some ways
Oh... I'll look into that instead then, thanks for the info @timja
This plan looks good and exhaustive. The only (blocking) point will be the credential: we need to document what is the exact kind of token required and what is the process to revoke it
Ideally a service principal / managed identity / workload identity should be used instead. SaS tokens are irrevocable and are worse than account keys in some ways
Unfortunately azcopy supports only SAS token for File Share: https://learn.microsoft.com/en-gb/azure/storage/common/storage-use-azcopy-v10#authorize-azcopy
According to https://learn.microsoft.com/en-us/azure/storage/common/storage-account-keys-manage?tabs=azure-portal#manually-rotate-access-keys and https://learn.microsoft.com/en-us/rest/api/storageservices/create-service-sas#revoke-a-sas, it's possible to revoke SAS tokens by rotating or regenerating the storage account access key.
Depending on how the file share is being used you can still use a service principal to generate a SaS on demand: https://learn.microsoft.com/en-us/azure/storage/blobs/storage-blob-user-delegation-sas-create-cli
No progress yet, WIP on Azure Service Principal to authenticate with the Azure File Storage (as hinted in https://github.com/jenkins-infra/helpdesk/issues/3414#issuecomment-1856196680, SP can be used to generate short-term SAS token allowing access to SA)
If it's not possible then we have to confirm that SAS token can be revoked through expiration date with terraform.
Update:
I managed to generate a File Share SAS token with az
authenticated via the service principal created in https://github.com/jenkins-infra/shared-tools/pull/131 and instantiated in https://github.com/jenkins-infra/azure/pull/557:
az storage share generate-sas --name updates-jenkins-io --account-name updatesjenkinsio --https-only --permissions dlrw --expiry 2024-01-19T00:00Z
Then I was able to use it with azcopy
to query the updates-jenkins-io
File Share from its URL https://updatesjenkinsio.file.core.windows.net/updates-jenkins-io/
I need to change the terraform module output to get the service principal client id instead of the service principal id, needed to authenticate az
with the service principal.
Then I'll be able to modify scripts to generate SAS token with short expiration date to use with azcopy
instead of blobxfer
.
As contributors.jenkins.io is also using a File Share SAS token, I'll use it as first candidate as it's less risky/important than jenkins.io, plugins site, javadoc or the mirror scripts.
For the record, https://github.com/jenkins-infra/azure/pull/591#issuecomment-1904459009
SAS token revocation via its expiry date doesn't work.
Opened https://github.com/jenkins-infra/azure/pull/595 to create a stored access policy for contributors.jenkins.io storage account that we can then pass as parameter to az
to create the SAS token, and easily revoke it via this policy if needed.
Ref:
I'm struggling using a service principal to interact with an Azure File Share.
After login with az --service principal
, I can generate a SAS token for the File Share with az storage share generate-sas
.
I ensured that the service principal has the correct role assignments on the File Share.
But when I then try to use this token with azcopy
, it fails with the following error:
Failed to perform Auto-login: AzureCLICredential: WARNING: Could not retrieve credential from local cache for service principal 80b9a38c-7d8e-4d7f-a8ca-4ef1193d6693 under tenant common. Trying credential under tenant 4c45fef2-1ba7-4120-80a0-9e2d03e9c2b6, assuming that is an app credential. ERROR: AADSTS700016: Application with identifier '80b9a38c-7d8e-4d7f-a8ca-4ef1193d6693' was not found in the directory 'Microsoft'. This can happen if the application has not been installed by the administrator of the tenant or consented to by any user in the tenant. You may have sent your authentication request to the wrong tenant. Trace ID: 6c4a5419-bae0-456c-a876-7e8a234be900 Correlation ID: 7a45204f-79d0-43ce-ab70-e85752baef1b Timestamp: 2024-01-17 18:02:14Z Interactive authentication is needed. Please run: az login
It seems we need to activate "Microsoft Entra Domain Services" to use service principal on file shares: https://learn.microsoft.com/en-us/azure/storage/files/storage-files-identity-auth-domain-services-enable?tabs=azure-portal
that's a shame 😢
Good news, it was a PEBCAK issue 😅
I successfully generated a SAS token from a service principal and listed the content of a File Share with it.
I need now to reproduce it as code.
As shown in az login --service principal
output, "<...> no credentials provided in your command and environment, we will query for account key for your storage account.": the service principal needs to have the role "Storage Account Contributor".
As seen in its description, it "lets you manage storage accounts, including accessing storage account keys which provide full access to storage account data."
I tried several lesser important roles but az share generate-sas token
fails with any of them.
yeah it'll be using the storage account key I expect
Update: contributors.jenkins.io replacement ready,
contributors.jenkins.io synchronized with success 🥳
I created a shared pipeline library function to encapsulate the SAS token generation:
Example of usage:
Update:
azcopy
and associated token generation logic (in puppet code)Update on azcopy
installation on puppet-managed systems:
azcopy
ensured on pkg.origin.jenkins.io
by https://github.com/jenkins-infra/jenkins-infra/pull/3295 (was already managd on the agent.trusted.ci.jenkins.io
azcopy
version is now tracked by updatecli
and verified with a bump from 10.21.0
to 10.23.0
New storage account and file share with its trusted.ci.jenkins.io service principal created in https://github.com/jenkins-infra/azure/pull/623
Looking at existing jenkins.io storage account, we noticed that most of its $70/month cost came from transactions. Using a Premium storage account instead with no transaction cost should allow us to decrease the monthly charge.
Compared to the existing "prodjenkinsio" storage account, this PR sets the new storage account replication type from "GRS" to "ZRS", sufficient for our use case.
Currently creating an initial copy manually from the existing file share to this new one.
Next step: replace blobxfer synchro by an azcopy one in https://github.com/jenkins-infra/jenkins.io/blob/master/Jenkinsfile#L89-L95
Next step: replace blobxfer synchro by an azcopy one in https://github.com/jenkins-infra/jenkins.io/blob/master/Jenkinsfile#L89-L95
In parallel :) Until the webservice is migrated
Looking at existing jenkins.io storage account, we noticed that most of its $70/month cost came from transactions. Using a Premium storage account instead with no transaction cost should allow us to decrease the monthly charge. Compared to the existing "prodjenkinsio" storage account, this PR sets the new storage account replication type from "GRS" to "ZRS", sufficient for our use case.
Unfortunately, using a Premium storage account implies a minimum size of 100Gio for file shares, corresponding to a cost of $20/month ($0.20 * 100Gio) cf https://azure.microsoft.com/en-us/pricing/details/storage/files/#pricing
In regard to this new information, we'll proceed to the file shares migration to Premium only if their cost is above $20 per month.
Also, it seems that GRS isn't available in Premium:
New storage account and file share with its trusted.ci.jenkins.io service principal created in jenkins-infra/azure#623
Corresponding AzureServiceprincipal credentials created on trusted.ci.jenkins.io
I've updated jenkins.io pipeline to upload the generated content to the new jenkinsio
storage account with azcopy in parallel of blobxfer, still uploading to prodjenkinsio
storage account.
After updating jenkins.io helm chart to allow specifying different storage accounts depending on the version (international or chinese) and to fix some resource name references, then. putting jenkinsio
storage account credentials in chars-secrets (private repo), I've migrated jenkins.io to its new jenkinsio
storage account, without any service disruption.
I'll do a cleanup at the end of this help desk.
Upload time went from 3 minutes 10 seconds with blobxfer to 1 minute 55 seconds with azcopy 🎉
Its current storage account prodpluginsite
costs around $70/month (similar to jenkins.io previous storage account cost):
pluginsite
storage account, File Share and service principal (infra.ci.jenkins.io)azcopy
to upload generated content to the new storage in parallelpluginsjenkinsio
storage account added in https://github.com/jenkins-infra/charts-secrets/commit/781ac38655f3ff2e4dc14863a0a7451eb63b2314 (private repo)blobxfer
related elementsUpload time went from 10\~11 minutes with blobxfer to less than 40\~50 seconds with azcopy 🤯 🎉
Same case as above, javadoc current storage account is costing around $50/month and can benefit from a migration to a Premium storage account:
javadoc
storage account, File Share and service principal (trusted.ci.jenkins.io)azcopy
to upload generated content to the new storage in paralleljavadoc
release to use the newly populated storage without service interruptionjavadocjenkinsio
storage account added in https://github.com/jenkins-infra/charts-secrets/commit/1848e3807b694d57b6b9ebe0537aeaad6b150883 (private repo)blobxfer
related elementsUpload time went from between 40 minutes and 1 hour 10 minutes with blobxfer to 11~13 minutes with azcopy 🥳
As noted in https://github.com/jenkins-infra/helpdesk/issues/3968#issuecomment-1973814916 we also need to replace any current long lived SAS tokens by short lived ones:
As noted in #3968 (comment) we also need to replace any current long lived SAS tokens by short lived ones:
Addressed in https://github.com/jenkins-infra/crawler/pull/144
The last remaining repository using blobxfer is https://github.com/jenkins-infra/mirror-scripts/.
This repo contains scripts cloned in pkg.origin.jenkins.io VM and used to synchronize update-center data and Jenkins mirrors.
Opening a status to update these scripts this afternoon.
Update about the operation to replace blobxfer
on pkg.origin.jenkins.io
VM:
azcopy
flags. Fixed with https://github.com/jenkins-infra/mirror-scripts/commit/a11ce239f3cc87c7bb13563d61a0d60c44222fe0www-data
user running the sync*.sh
scripts did not had the correct archives
SSH private key in its $HOME/.ssh
directory. Fixed temporarily by adding the proper SSH key for the www-data
from the mirrorbrain
user.We had a bad surprise during the operation: as azcopy
MD5 mode with "Hiddenfile" (--local-hash-storage-mode=HiddenFiles
) creates 1 hidden file for each file, we started to drown the VM under I/Os (and the mirror transfers as more than 12k files were created and had to be transferred trhough either rsync
or azcopy
).
Trying to switch to user_xattr
(as the disk is a standard ext4 with xattr
support) to have better performances, we realized that the jenkins-infra/mirrorscripts can be run by distinct users on the VM (www-data
from any trusted.ci jobs and mirrorsync
for the full sync) which creates a LOT of concurrent access => this made the xattr to fail as the owner of files changes over time.
We decided to unify the permissions and ownership on the VM (tested manually)
mirrorbrain
with read and write
mirrorbrain
as SSH and Rsync user: https://github.com/jenkins-infra/update-center2/pull/770www-data
with read only access, to allow httpd
(apache2) to serve files without any risk of tampering dataRequired to change rsync
calls (removing the -g
and -o
options from the -a
and explicitly set ownership with --chown
as we use rsync 3.1.x+ everywhere):
Then, we set up the permissions on the pkg
VM:
cd /srv/releases/jenkins
# -h to avoid dereferencing links
# Usage of 'find' is more efficient than recursive chown/chmod
find . -exec chown -h mirrorbrain:www-data
find . -type f -exec chmod 640 {} \;
find . -type l -exec chmod 640 {} \;
find . -type d -exec chmod 750 {} \;
# Repeat on /var/www
=> It did work BUT it broke OSUOSL: success in sending data to them, BUT their Apache2 and Rsync services always answered "403 Forbidden". => Which also broke archives.jenkins.io content updates (service up, serving data prior to the operation but not updated)
Update:
cd /srv/releases/jenkins # -h to avoid dereferencing links # Usage of 'find' is more efficient than recursive chown/chmod find . -exec chown -h mirrorbrain:www-data find . -type f -exec chmod 640 {} \; find . -type l -exec chmod 640 {} \; find . -type d -exec chmod 750 {} \; # Repeat on /var/www
=> It did work BUT it broke OSUOSL: success in sending data to them, BUT their Apache2 and Rsync services always answered "403 Forbidden". => Which also broke archives.jenkins.io content updates (service up, serving data prior to the operation but not updated)
This change was correct regarding ownership, but OSUOSL support team told us (as per our request by mail) that the permissions for "others" should be read-only (instead of "no access" in the setup above) to allow apache2 and rsync to read files (and serve them).
=> we could think about such a setup (and let mirrorbrain own everything) but for now, setting file/link permissions to 644
and directories to 755
on both scripts and pkg
VM solved the issue (along the HUGE help from OSUOSL whom fixed all permissions on their systems before we even understood the issue: thanks!).
However we see really bad performances with azcopy
, storage and we have some config as code persistence to do:
/srv/releases/.azcopy
: we need to purge this data one way of another (less logs? clean up ?)Additionally:
Update:
azcopy
we are running seems to flatten directory hiearchy as per https://github.com/jenkins-infra/helpdesk/issues/4013 Update:
Everything should be working as usual now. We keep the status.jenkins.io issue opened for the next 24h to ensure we don't have unexpected behaviors.
What did we do:
update-center uses it with "time based" comparison and the whole build is around 5 min and 15s which is way above what we expect.
Switching to azcopy copy
with time-based comparison decreased the impact of the copy to get.jenkins.io: jenkins-infra/update-center2 execution baseline is now under the 5 min threshold (4min15 to 4min45 on the past builds)
The "full" sync takes more than 30 min which ends in the Azure token expiring. It is using MD5 with xattr.
Switching to azcopy copy
with time-based comparison had the same performance effect. We see the "full sync" taking 2 min up to 5 min. Drastic improvement.
azcopy generates a LOT of logs in /srv/releases/.azcopy: we need to purge this data one way of another (less logs? clean up ?)
Solved by decreasing the scanning logs produced by azcopy
operations (see https://github.com/jenkins-infra/mirror-scripts/pull/22 and https://github.com/jenkins-infra/mirror-scripts/pull/23), along with introducing a regular cleanup (https://github.com/jenkins-infra/mirror-scripts/pull/19)
Persist permissions sets in Puppet
www-data
in both /srv/releases/jenkins
and /var/www
=> only the www-data
.ssh directory is found as expected.We must search and check how is the core/package process handling the copy to the remote pkg (jenkins-infra/release and jenkinsci/packaging)
Checked: all the "Core Release" and "Core package" processes were already using mirrorbrain
user to connect to the pkg
VM. As seen above, no permission issues after today's weekly Core releas 2.451.
One of the numerous azcopy we are running seems to flatten directory hiearchy as per https://github.com/jenkins-infra/helpdesk/issues/4013
Fixed by the change to azcopy copy
along with adding a trailing /*
on the source (not working with azcopy sync
). See https://github.com/jenkins-infra/mirror-scripts/pull/23
OSUOSL is back (Apache/rsync) but still not visible on the get.jenkins.io mirrors: we have to check the scanning messages (and enable it if needed)
Fixed with the following set of events:
mirrorbits
pods of get.jenkins.io
one after the other, to ensure any scanning routine stuck are killed.# kubectl exec to a mirrorbit pod of get.jenkins.io
$ mirrorbits list # Retrieve the mirror ID
$ mirrorbits disable <mirror ID>
$ mirrorbits scan -enable <mirror ID>
=> after 3-4 minutes, the new core release 2.451 (which was already present since 1 hour in OSUOSL servers) was detected by mirrorbits on get.jenkins.io: https://get.jenkins.io/windows/2.451/jenkins.msi?mirrorlist
Next steps:
blobxfer
from everywhere (including puppet)ssh
configuration for the www-data
userUpdate:
More errors were reported around the update center builds: https://github.com/jenkins-infra/helpdesk/issues/4016. Plugins were available on updates.jenkins.io
but never delivered to the get.jenkins.io
mirrorin system.
azcopy
which breaks shell stdin causing the loop over freshly released plugins to not run as expected. Fixed with a shell hack (see issue for details).sync-recent-release.sh
script: https://github.com/jenkins-infra/mirror-scripts/commit/e04e9a0fdb1185c4856f03b3b370a827d1168aa8In the point above, the flag --overwrite=
was removed causing slower azcopy
commands. We've added it back in https://github.com/jenkins-infra/mirror-scripts/commit/a99b89cb820ea9808f5786db800670cdcd133c61 : update center builds are now averaging at ~4min30s (instead 5min).
The TTL of the Azure Storage account used by sync.sh
has been rolled back to 30
minutes - https://github.com/jenkins-infra/mirror-scripts/pull/24
sync.sh
script is running in less than 30 minutes. However the benefits are low: it serves its purpose (TTL) so let's keep it like this.The script sync.sh
was failing to execute on the pkg
VM when run by the crontab on the VM
PATH
variable than shell sessions: hotfix to include /usr/local/bin
in https://github.com/jenkins-infra/mirror-scripts/commit/674697affcd4bb91173ef170d7b93cfc3a43af7b fixed the issueazcopy
hack in sync.sh
: https://github.com/jenkins-infra/mirror-scripts/commit/59cb5fbbc4550fe0422af26bf75733669695c09e and https://github.com/jenkins-infra/mirror-scripts/commit/9f53853755334b0d70da0431615a9dca0b4c3a4eIt's been 24 hours without any errors reported around plugin releases, mirror synchronization (script sync.sh
) or update_center2 updates => we're going to close the status.jenkins.io message.
Next Steps (updated after syncing with the team):
The following Puppet changes are needed to close this issue:
/srv/releases/.azure-storage-env
(with the storage credentials, e.g. sensitive values) as it is not managed todayThe following Puppet changes should be done after this issue:
mirrorbrain
user again - https://github.com/jenkins-infra/jenkins-infra/issues/3351 (and https://github.com/jenkins-infra/helpdesk/issues/2970)www-data
user cleanup - https://github.com/jenkins-infra/jenkins-infra/issues/3353Cleanup of SA tokens
PRs for cleanup of blobxfer
on the pkg
VM:
PRs for cleanup of
blobxfer
on thepkg
VM:* [cleanup(pkgrepo) remove last remnants of blobxfer jenkins-infra#3354](https://github.com/jenkins-infra/jenkins-infra/pull/3354) * [cleanup: remove last remnants of blobxfer mirror-scripts#25](https://github.com/jenkins-infra/mirror-scripts/pull/25)
Deployed with success and ran the cleanup process. Watching update center and sync.sh
PRs for cleanup of
blobxfer
on thepkg
VM:* [cleanup(pkgrepo) remove last remnants of blobxfer jenkins-infra#3354](https://github.com/jenkins-infra/jenkins-infra/pull/3354) * [cleanup: remove last remnants of blobxfer mirror-scripts#25](https://github.com/jenkins-infra/mirror-scripts/pull/25)
Deployed with success and ran the cleanup process. Watching update center and sync.sh
Looks good!
Managing mirrorbrain
(crontab, user, ssh keys, etc.) user and the .azure-storage-env
file in https://github.com/jenkins-infra/jenkins-infra/pull/3357
Update: https://github.com/jenkins-infra/jenkins-infra/pull/3357#issuecomment-2026766564 was successfully deployed.
@lemeurherve I'm handing over to you for the SA token cleanup you mentioned as I have no idea what to cleanup and this nis the last step before closing this issue.
@lemeurherve I'm handing over to you for the SA token cleanup you mentioned as I have no idea what to cleanup and this nis the last step before closing this issue.
Here are the last cleanup tasks remaining before we can close this issue:
azurerm_storage_account_sas.get_jenkins_io
data & corresponding outputs: https://github.com/jenkins-infra/azure/blob/751ac5d58f4008b81ebc76f36f41611388e5bca5/get.jenkins.io.tf#L53-L95
contributors_jenkins_io_share_url
output: https://github.com/jenkins-infra/azure/blob/751ac5d58f4008b81ebc76f36f41611388e5bca5/contributors.jenkins.io.tf#L39-L41
CONTRIBUTORS_JENKINS_IO_FILESHARE_SAS_QUERYSTRING
infra.ci.jenkins.io secrets in jenkins-infra/charts-secrets (private repo)jenkinsio
file share of the prodjenkinsio
storage account (there are two other file shares zhjenkinsio
& cnjenkinsio
that we must keep for now until https://github.com/jenkins-infra/helpdesk/issues/3379 is started), replaced by jenkins-io
file share in jenkinsio
storage account
BLOBXFER_STORAGEACCOUNTKEY
infra.ci.jenkins.io secrets in jenkins-infra/charts-secrets (private repo)BLOBXFER_STORAGEACCOUNTKEY
credentials in trusted.ci.jenkins.ioprodpluginsite
storage account and its pluginsite
file share, replaced by pluginsjenkinsio
storage account
PLUGINSITE_STORAGEACCOUNTKEY
infra.ci.jenkins.io secrets in jenkins-infra/charts-secrets (private repo)PLUGINSITE_STORAGEACCOUNTKEY
credentials in trusted.ci.jenkins.ioprodjavadoc
storage account and its javadoc
file share, replaced by javadocjenkinsio
storage account
JAVADOC_STORAGEACCOUNTKEY
credentials in trusted.ci.jenkins.ioupdates-jenkins-io-file-share-sas-token-query-string
credentials in trusted.ci.jenkins.ioAZURE_STORAGE_ACCOUNT
replaced by STORAGE_NAME
in jenkins-infra/jenkins-infra
blobxfer completely replaced by azcopy, cleanup done, all concerned jobs green, closing this issue.
As we've encountered some issues with blobxfer recently (#3411), and as its last release is quite dated, check if we could replace it with an
az-cli
command like what's done in the pipeline library: https://github.com/jenkins-infra/pipeline-library/blob/93b13be5d876d90d8cd145b11c9f9fe457239db9/vars/publishReports.groovy#L55-L59Related:
3034
3100
3411