jenkins-infra / helpdesk

Open your Infrastructure related issues here for the Jenkins project
https://github.com/jenkins-infra/helpdesk/issues/new/choose
17 stars 10 forks source link

Check if we could replace `blobxfer` by `azcopy` #3414

Closed lemeurherve closed 6 months ago

lemeurherve commented 1 year ago

As we've encountered some issues with blobxfer recently (#3411), and as its last release is quite dated, check if we could replace it with an az-cli command like what's done in the pipeline library: https://github.com/jenkins-infra/pipeline-library/blob/93b13be5d876d90d8cd145b11c9f9fe457239db9/vars/publishReports.groovy#L55-L59

Related:

lemeurherve commented 10 months ago

We'll use https://github.com/Azure/azure-storage-azcopy.

lemeurherve commented 10 months ago

Current references to blobxfer (excluding some updatecli manifests and co):

Corresponding storage accounts:

To replace blobxfer by azcopy for manipulating these files shares, SAS tokens need to be generated and stored as credentials (blobxfer is using the access key of the storage accounts, much less grain fined). To do so, these storage accounts and file shares have to be imported as code in https://github.com/jenkins-infra/azure first.

The plan is to:

Note: if time permits, we should also replace az storage by azcopy in infra.publishReports: https://github.com/jenkins-infra/pipeline-library/blob/350e1b561b633ae204265ce89bca20edfce3c02f/vars/publishReports.groovy#L55-L59

timja commented 10 months ago

To replace blobxfer by azcopy for manipulating these files shares, SAS tokens need to be generated and stored as credentials (blobxfer is using the access key of the storage accounts, much less grain fined). To do so, these storage accounts and file shares have to be imported as code in jenkins-infra/azure first.

Ideally a service principal / managed identity / workload identity should be used instead. SaS tokens are irrevocable and are worse than account keys in some ways

lemeurherve commented 10 months ago

Oh... I'll look into that instead then, thanks for the info @timja

dduportal commented 10 months ago

This plan looks good and exhaustive. The only (blocking) point will be the credential: we need to document what is the exact kind of token required and what is the process to revoke it

lemeurherve commented 10 months ago

Ideally a service principal / managed identity / workload identity should be used instead. SaS tokens are irrevocable and are worse than account keys in some ways

Unfortunately azcopy supports only SAS token for File Share: https://learn.microsoft.com/en-gb/azure/storage/common/storage-use-azcopy-v10#authorize-azcopy

According to https://learn.microsoft.com/en-us/azure/storage/common/storage-account-keys-manage?tabs=azure-portal#manually-rotate-access-keys and https://learn.microsoft.com/en-us/rest/api/storageservices/create-service-sas#revoke-a-sas, it's possible to revoke SAS tokens by rotating or regenerating the storage account access key.

timja commented 10 months ago

Depending on how the file share is being used you can still use a service principal to generate a SaS on demand: https://learn.microsoft.com/en-us/azure/storage/blobs/storage-blob-user-delegation-sas-create-cli

lemeurherve commented 9 months ago

No progress yet, WIP on Azure Service Principal to authenticate with the Azure File Storage (as hinted in https://github.com/jenkins-infra/helpdesk/issues/3414#issuecomment-1856196680, SP can be used to generate short-term SAS token allowing access to SA)

If it's not possible then we have to confirm that SAS token can be revoked through expiration date with terraform.

lemeurherve commented 9 months ago

Update:

I managed to generate a File Share SAS token with az authenticated via the service principal created in https://github.com/jenkins-infra/shared-tools/pull/131 and instantiated in https://github.com/jenkins-infra/azure/pull/557: az storage share generate-sas --name updates-jenkins-io --account-name updatesjenkinsio --https-only --permissions dlrw --expiry 2024-01-19T00:00Z

Then I was able to use it with azcopy to query the updates-jenkins-io File Share from its URL https://updatesjenkinsio.file.core.windows.net/updates-jenkins-io/

I need to change the terraform module output to get the service principal client id instead of the service principal id, needed to authenticate az with the service principal.

Then I'll be able to modify scripts to generate SAS token with short expiration date to use with azcopy instead of blobxfer.

lemeurherve commented 9 months ago

As contributors.jenkins.io is also using a File Share SAS token, I'll use it as first candidate as it's less risky/important than jenkins.io, plugins site, javadoc or the mirror scripts.

lemeurherve commented 8 months ago

For the record, https://github.com/jenkins-infra/azure/pull/591#issuecomment-1904459009

SAS token revocation via its expiry date doesn't work.

lemeurherve commented 8 months ago

Opened https://github.com/jenkins-infra/azure/pull/595 to create a stored access policy for contributors.jenkins.io storage account that we can then pass as parameter to az to create the SAS token, and easily revoke it via this policy if needed.

Ref:

lemeurherve commented 8 months ago

Related:

lemeurherve commented 8 months ago

I'm struggling using a service principal to interact with an Azure File Share.

After login with az --service principal, I can generate a SAS token for the File Share with az storage share generate-sas. I ensured that the service principal has the correct role assignments on the File Share. But when I then try to use this token with azcopy, it fails with the following error:

Failed to perform Auto-login: AzureCLICredential: WARNING: Could not retrieve credential from local cache for service principal 80b9a38c-7d8e-4d7f-a8ca-4ef1193d6693 under tenant common. Trying credential under tenant 4c45fef2-1ba7-4120-80a0-9e2d03e9c2b6, assuming that is an app credential. ERROR: AADSTS700016: Application with identifier '80b9a38c-7d8e-4d7f-a8ca-4ef1193d6693' was not found in the directory 'Microsoft'. This can happen if the application has not been installed by the administrator of the tenant or consented to by any user in the tenant. You may have sent your authentication request to the wrong tenant. Trace ID: 6c4a5419-bae0-456c-a876-7e8a234be900 Correlation ID: 7a45204f-79d0-43ce-ab70-e85752baef1b Timestamp: 2024-01-17 18:02:14Z Interactive authentication is needed. Please run: az login

It seems we need to activate "Microsoft Entra Domain Services" to use service principal on file shares: https://learn.microsoft.com/en-us/azure/storage/files/storage-files-identity-auth-domain-services-enable?tabs=azure-portal

timja commented 8 months ago

that's a shame 😢

lemeurherve commented 8 months ago

Good news, it was a PEBCAK issue 😅

I successfully generated a SAS token from a service principal and listed the content of a File Share with it.

lemeurherve commented 8 months ago

I need now to reproduce it as code.

lemeurherve commented 8 months ago

As shown in az login --service principal output, "<...> no credentials provided in your command and environment, we will query for account key for your storage account.": the service principal needs to have the role "Storage Account Contributor".

As seen in its description, it "lets you manage storage accounts, including accessing storage account keys which provide full access to storage account data."

lemeurherve commented 8 months ago

I tried several lesser important roles but az share generate-sas token fails with any of them.

timja commented 8 months ago

yeah it'll be using the storage account key I expect

lemeurherve commented 8 months ago

Update: contributors.jenkins.io replacement ready,

lemeurherve commented 8 months ago

contributors.jenkins.io synchronized with success 🥳

lemeurherve commented 8 months ago

I created a shared pipeline library function to encapsulate the SAS token generation:

Example of usage:

dduportal commented 8 months ago

Update:

dduportal commented 8 months ago

Update on azcopy installation on puppet-managed systems:

lemeurherve commented 7 months ago

New storage account and file share with its trusted.ci.jenkins.io service principal created in https://github.com/jenkins-infra/azure/pull/623

Looking at existing jenkins.io storage account, we noticed that most of its $70/month cost came from transactions. Using a Premium storage account instead with no transaction cost should allow us to decrease the monthly charge.

Compared to the existing "prodjenkinsio" storage account, this PR sets the new storage account replication type from "GRS" to "ZRS", sufficient for our use case.

Currently creating an initial copy manually from the existing file share to this new one.

Next step: replace blobxfer synchro by an azcopy one in https://github.com/jenkins-infra/jenkins.io/blob/master/Jenkinsfile#L89-L95

dduportal commented 7 months ago

Next step: replace blobxfer synchro by an azcopy one in https://github.com/jenkins-infra/jenkins.io/blob/master/Jenkinsfile#L89-L95

In parallel :) Until the webservice is migrated

lemeurherve commented 7 months ago

Looking at existing jenkins.io storage account, we noticed that most of its $70/month cost came from transactions. Using a Premium storage account instead with no transaction cost should allow us to decrease the monthly charge. Compared to the existing "prodjenkinsio" storage account, this PR sets the new storage account replication type from "GRS" to "ZRS", sufficient for our use case.

Unfortunately, using a Premium storage account implies a minimum size of 100Gio for file shares, corresponding to a cost of $20/month ($0.20 * 100Gio) cf https://azure.microsoft.com/en-us/pricing/details/storage/files/#pricing

image

In regard to this new information, we'll proceed to the file shares migration to Premium only if their cost is above $20 per month.

lemeurherve commented 7 months ago

Also, it seems that GRS isn't available in Premium:

image
lemeurherve commented 7 months ago

New storage account and file share with its trusted.ci.jenkins.io service principal created in jenkins-infra/azure#623

Corresponding AzureServiceprincipal credentials created on trusted.ci.jenkins.io

lemeurherve commented 7 months ago

I've updated jenkins.io pipeline to upload the generated content to the new jenkinsio storage account with azcopy in parallel of blobxfer, still uploading to prodjenkinsio storage account.

After updating jenkins.io helm chart to allow specifying different storage accounts depending on the version (international or chinese) and to fix some resource name references, then. putting jenkinsio storage account credentials in chars-secrets (private repo), I've migrated jenkins.io to its new jenkinsio storage account, without any service disruption.

I'll do a cleanup at the end of this help desk.

Upload time went from 3 minutes 10 seconds with blobxfer to 1 minute 55 seconds with azcopy 🎉

lemeurherve commented 7 months ago

plugin-site

Its current storage account prodpluginsite costs around $70/month (similar to jenkins.io previous storage account cost):

image

Upload time went from 10\~11 minutes with blobxfer to less than 40\~50 seconds with azcopy 🤯 🎉

lemeurherve commented 7 months ago

javadoc

Same case as above, javadoc current storage account is costing around $50/month and can benefit from a migration to a Premium storage account:

image

Upload time went from between 40 minutes and 1 hour 10 minutes with blobxfer to 11~13 minutes with azcopy 🥳

lemeurherve commented 7 months ago

As noted in https://github.com/jenkins-infra/helpdesk/issues/3968#issuecomment-1973814916 we also need to replace any current long lived SAS tokens by short lived ones:

lemeurherve commented 7 months ago

As noted in #3968 (comment) we also need to replace any current long lived SAS tokens by short lived ones:

Addressed in https://github.com/jenkins-infra/crawler/pull/144

lemeurherve commented 6 months ago

The last remaining repository using blobxfer is https://github.com/jenkins-infra/mirror-scripts/.

This repo contains scripts cloned in pkg.origin.jenkins.io VM and used to synchronize update-center data and Jenkins mirrors.

Opening a status to update these scripts this afternoon.

dduportal commented 6 months ago

Update about the operation to replace blobxfer on pkg.origin.jenkins.io VM:


We had a bad surprise during the operation: as azcopy MD5 mode with "Hiddenfile" (--local-hash-storage-mode=HiddenFiles) creates 1 hidden file for each file, we started to drown the VM under I/Os (and the mirror transfers as more than 12k files were created and had to be transferred trhough either rsync or azcopy).

Trying to switch to user_xattr (as the disk is a standard ext4 with xattr support) to have better performances, we realized that the jenkins-infra/mirrorscripts can be run by distinct users on the VM (www-data from any trusted.ci jobs and mirrorsync for the full sync) which creates a LOT of concurrent access => this made the xattr to fail as the owner of files changes over time.


We decided to unify the permissions and ownership on the VM (tested manually)

Required to change rsync calls (removing the -g and -o options from the -a and explicitly set ownership with --chown as we use rsync 3.1.x+ everywhere):

Then, we set up the permissions on the pkg VM:

cd /srv/releases/jenkins
# -h to avoid dereferencing links
# Usage of 'find' is more efficient than recursive chown/chmod
find . -exec chown -h mirrorbrain:www-data
find . -type f -exec chmod 640 {} \;
find . -type l -exec chmod 640 {} \;
find . -type d -exec chmod 750 {} \;

# Repeat on /var/www

=> It did work BUT it broke OSUOSL: success in sending data to them, BUT their Apache2 and Rsync services always answered "403 Forbidden". => Which also broke archives.jenkins.io content updates (service up, serving data prior to the operation but not updated)

dduportal commented 6 months ago

Update:

cd /srv/releases/jenkins
# -h to avoid dereferencing links
# Usage of 'find' is more efficient than recursive chown/chmod
find . -exec chown -h mirrorbrain:www-data
find . -type f -exec chmod 640 {} \;
find . -type l -exec chmod 640 {} \;
find . -type d -exec chmod 750 {} \;

# Repeat on /var/www

=> It did work BUT it broke OSUOSL: success in sending data to them, BUT their Apache2 and Rsync services always answered "403 Forbidden". => Which also broke archives.jenkins.io content updates (service up, serving data prior to the operation but not updated)

This change was correct regarding ownership, but OSUOSL support team told us (as per our request by mail) that the permissions for "others" should be read-only (instead of "no access" in the setup above) to allow apache2 and rsync to read files (and serve them).

=> we could think about such a setup (and let mirrorbrain own everything) but for now, setting file/link permissions to 644 and directories to 755 on both scripts and pkg VM solved the issue (along the HUGE help from OSUOSL whom fixed all permissions on their systems before we even understood the issue: thanks!).

However we see really bad performances with azcopy, storage and we have some config as code persistence to do:

dduportal commented 6 months ago

Additionally:

dduportal commented 6 months ago

Update:

dduportal commented 6 months ago

Update:

Everything should be working as usual now. We keep the status.jenkins.io issue opened for the next 24h to ensure we don't have unexpected behaviors.

What did we do:

update-center uses it with "time based" comparison and the whole build is around 5 min and 15s which is way above what we expect.

Switching to azcopy copy with time-based comparison decreased the impact of the copy to get.jenkins.io: jenkins-infra/update-center2 execution baseline is now under the 5 min threshold (4min15 to 4min45 on the past builds)

The "full" sync takes more than 30 min which ends in the Azure token expiring. It is using MD5 with xattr.

Switching to azcopy copy with time-based comparison had the same performance effect. We see the "full sync" taking 2 min up to 5 min. Drastic improvement.

azcopy generates a LOT of logs in /srv/releases/.azcopy: we need to purge this data one way of another (less logs? clean up ?)

Solved by decreasing the scanning logs produced by azcopy operations (see https://github.com/jenkins-infra/mirror-scripts/pull/22 and https://github.com/jenkins-infra/mirror-scripts/pull/23), along with introducing a regular cleanup (https://github.com/jenkins-infra/mirror-scripts/pull/19)

Persist permissions sets in Puppet

We must search and check how is the core/package process handling the copy to the remote pkg (jenkins-infra/release and jenkinsci/packaging)

Checked: all the "Core Release" and "Core package" processes were already using mirrorbrain user to connect to the pkg VM. As seen above, no permission issues after today's weekly Core releas 2.451.

One of the numerous azcopy we are running seems to flatten directory hiearchy as per https://github.com/jenkins-infra/helpdesk/issues/4013

Fixed by the change to azcopy copy along with adding a trailing /* on the source (not working with azcopy sync). See https://github.com/jenkins-infra/mirror-scripts/pull/23

OSUOSL is back (Apache/rsync) but still not visible on the get.jenkins.io mirrors: we have to check the scanning messages (and enable it if needed)

Fixed with the following set of events:

# kubectl exec to a mirrorbit pod of get.jenkins.io
$ mirrorbits list # Retrieve the mirror ID
$ mirrorbits disable <mirror ID>
$ mirrorbits scan -enable <mirror ID>

=> after 3-4 minutes, the new core release 2.451 (which was already present since 1 hour in OSUOSL servers) was detected by mirrorbits on get.jenkins.io: https://get.jenkins.io/windows/2.451/jenkins.msi?mirrorlist


Next steps:

dduportal commented 6 months ago

Update:

Next Steps (updated after syncing with the team):

dduportal commented 6 months ago

PRs for cleanup of blobxfer on the pkg VM:

dduportal commented 6 months ago

PRs for cleanup of blobxfer on the pkg VM:

* [cleanup(pkgrepo) remove last remnants of blobxfer jenkins-infra#3354](https://github.com/jenkins-infra/jenkins-infra/pull/3354)

* [cleanup: remove last remnants of blobxfer mirror-scripts#25](https://github.com/jenkins-infra/mirror-scripts/pull/25)

Deployed with success and ran the cleanup process. Watching update center and sync.sh

dduportal commented 6 months ago

PRs for cleanup of blobxfer on the pkg VM:

* [cleanup(pkgrepo) remove last remnants of blobxfer jenkins-infra#3354](https://github.com/jenkins-infra/jenkins-infra/pull/3354)

* [cleanup: remove last remnants of blobxfer mirror-scripts#25](https://github.com/jenkins-infra/mirror-scripts/pull/25)

Deployed with success and ran the cleanup process. Watching update center and sync.sh

Looks good!

dduportal commented 6 months ago

Managing mirrorbrain (crontab, user, ssh keys, etc.) user and the .azure-storage-env file in https://github.com/jenkins-infra/jenkins-infra/pull/3357

dduportal commented 6 months ago

Update: https://github.com/jenkins-infra/jenkins-infra/pull/3357#issuecomment-2026766564 was successfully deployed.

@lemeurherve I'm handing over to you for the SA token cleanup you mentioned as I have no idea what to cleanup and this nis the last step before closing this issue.

lemeurherve commented 6 months ago

@lemeurherve I'm handing over to you for the SA token cleanup you mentioned as I have no idea what to cleanup and this nis the last step before closing this issue.

Here are the last cleanup tasks remaining before we can close this issue:

lemeurherve commented 6 months ago

blobxfer completely replaced by azcopy, cleanup done, all concerned jobs green, closing this issue.