Open alexjba opened 3 months ago
Adding both ui-team and backend-team labels. I guess any team could take this
@noeliaSD @jrainville Feel free to update the tags if needed
@Seitseman @alaibe You're the only windows devs I know of. Do you also experience this slow build process?
Adding @Cuteivist
I know there are some ugly hacks on Windows where we need to clean everything before each run, because otherwise subsequent jobs would fail. @jakubgs probably know more.
Indeed we do wipe the whole workspace after every build due to various issues: https://github.com/status-im/status-desktop/blob/146a6e85018bd7e8cf9a68735296a8ca2220faa8/ci/Jenkinsfile.windows#L121-L124
But checkout takes around 3 minutes only:
What takes about ~8 minutes is make update
:
07:04:53 + make update
07:08:18 Building: Nim compiler
07:12:13 + make deps
07:12:16 + make status-go
The build of the package itself takes about 12 minutes, which is about the same as on other platforms.
So the issue is in checking out all the submodules, not in the build.
In general I see two possibilities of improving the performance of Windows builds:
2.28.0
version due to obsure build issue from 2021 to 2.45.2
.Both have a possibility of improving Git checkout performance, while first option could improve build performance itself.
Actually, the locking has broken down a while ago on some hosts:
> a ci-slave-windows -m win_shell -o -a 'git version'
windows-03.he-eu-hel1.ci.devel | CHANGED | rc=0 | (stdout) git version 2.41.0.windows.1
windows-01.he-eu-hel1.ci.devel | CHANGED | rc=0 | (stdout) git version 2.28.0.windows.1
windows-02.he-eu-hel1.ci.devel | CHANGED | rc=0 | (stdout) git version 2.37.1.windows.1
windows-01.he-eu-hel1.ci.release | CHANGED | rc=0 | (stdout) git version 2.37.3.windows.1
I will attempt to upgrade on one host to see if it breaks anything as a first step.
So we might as well just drop that fix and upgrade all to a recent 2.45.2
version.
I have opened an issue with Hetzner to inquire about reinstallation with separate ReFS partition:
Hello,
I'd like to know if it would it be possible to reinstall our Windows Server 2019 hosts with a different partition layout?
We have found that when we use them for CI builds the Git performance is very bad, and we were thinking employing ReFS would help with this:
- https://github.com/MicrosoftDocs/windowsserverdocs/blob/main/WindowsServerDocs/storage/refs/refs-overview.md
- https://geekflare.com/implement-refs-file-system-in-windows-server/
- https://learn.microsoft.com/en-us/windows/dev-drive/
Would it be possible to do a layout where ~50 GB are allocated for C:/ system partition on NTFS and the rest for a D:/ data partition on ReFS?
Would this be possible while maintaining the RAID1 layout?
Cheers
Ticket: #2024060603027071
I have performed a filesystem benchmark on windows-03
to have for future comparison with ReFS filesystem:
This is not a great comparison since this is a C:
drive that is used by the system.
Hetzner support gave me actually useful information after some probing:
Windows as activated via robot will always use the whole disk space of DISK0 for the system partition (aka C:). In case your server has 2 hardddisks of the same size a mirror will be set up on DISK1.
You can request to perform a individual and manually configured partition layout. In order to do this please don't activate a installation via robot and perform a support request instead asking for a custom windows installation to be activated for you - our support needs to get confirmation for the deletion off all data on the respective server, so please also add those details to your request.
As a reply you'll receive a booted windows installation and kvm access with which you can set up the partitioning - the remaining installation beside the partitioning phase will still be unattend and does not require any further interaction from your side. Our installations auto activate during deployment so you don't need to consider any licensing related topics there.
ISOs in general can be found at our mirror (only accessible from within the hetzner network):
http://mirror.hetzner.com/bootimages/windows/
Note that those isos are vanilla and do not include a license. Licensing a vanilla installations afterwards is not supported - so if you want to use a hetzner licensed windows the process needs to be as described above.
Last not least:
Since you're mentioning ReFS you may be already aware, that it can't be used on a system partition and needs to be set up afterwards. Therefore another option would be to either reinstall automatically with the default hetzner partition setup or to not reinstall at all but just to disable the mirror and repartition the second harddisk afterwards, using ReFS by creating a respective storage pool. You may also online resize the C:\ partition and use the freed space in the same way.
So we have a few options:
C:\
partition and use the reclaimed space for ReFS partition.And indeed, if we go to Computer Management
and then Disk Management
we can see that the NTFS partition is mirrored:
And the context menu does have some options to disable mirroring:
Not sure how those two are different though.
Interestingly the mirror is not enabled on windows-03
, most probably due to some kind of misconfiguration during instllation:
Which makes it a perfect candidate to test ReFS.
After removing the confusing D:
, E:
, and F:
drives I can create a ReFS volume:
Side-by-side comparison of clone performance of status-desktop
on the same device with NTFS and ReFS filesystems using:
time git clone --recurse-submodules https://github.com/status-im/status-desktop.git
Filesystem | Recursive Clone Time |
---|---|
NTFS - System volume | 2m11.168s |
NTFS - Dedicated volume | 1m58.713s |
ReFS - 4K unit size | 1m56.189s |
ReFS - 64K unit size | 1m57.062s |
It appears at least in this test there's no big difference between NTFS and ReFS, but there is a big difference between using a system partition and a dedicated partition. But a real test would involve Jenkins running the whole build on ReFS.
Apparently:
What is the difference between the ‘Break Mirror’ vs the ‘Remove Mirror’ option?
The “Break Mirror” operation, will stop the mirroring on the selected volume, without affecting the data on any disk. (Data will remain untouched on both disks).
The “Remove Mirror” operation, will stop the mirroring on the selected volume and destroys all the data on the mirror disk. (Data will remain only on one disk).
I have ran a status-desktop
build from master
on windows-03
after updating the node configuration:
https://ci.status.im/job/status-desktop/job/systems/job/windows/job/x86_64/job/package/662/
And the results don't appear to show any difference for dedicated ReFS:
I will try with dedicated NTFS volume.
NTFS makes little not no difference:
I think we should upgrade Windows slaves to match Linux ones, as currently they are smaller, AX41 instead of AX61:
Should speed up the builds by at least a bit. In addition to splitting the mirror setup.
Actually, even just switching from AX41 to AX42 should make a big difference in terms of CPU power:
https://www.cpubenchmark.net/compare/3481vs6001/AMD-Ryzen-5-3600-vs-AMD-Ryzen-7-PRO-8700GE
We should get one AX42, bootstrap it, disable mirroring of the volumes, get ReFS on the volume, and test build performance.
In order to bootstrap a fresh Windows Server 2019 Standard we use the following script: https://github.com/status-im/infra-tf-google-cloud/blob/master/setup.ps1 Originally this script was used to bootstrap Google Cloud Windows hosts like ths:
/* Run PowerShell script for initial setup of a Window machine */
sysprep-specialize-script-ps1 = (var.win_password == null ? null :
templatefile("${path.module}/setup.ps1", {
hostname = each.key
domain = var.domain
password = var.win_password
ssh_key = var.ssh_keys[0]
})
)
But for Hetzner the script will have to be modified to provide these four variables.
Once the script finishes you can finally connect over SSH using administrator
user, and you can run full bootstrap:
https://github.com/status-im/infra-role-bootstrap-windows
https://github.com/status-im/infra-ci/blob/9fb0f567b8c484e3f08341ba553d99d82dcb7546/ansible/bootstrap.yml#L53-L60
Once that is complete we can run roles specific to the CI slave configuration: https://github.com/status-im/infra-ci/blob/9fb0f567b8c484e3f08341ba553d99d82dcb7546/ansible/slaves.yml#L38-L45
In order to make the separate J:
volume work for Jenkins user you'll also need to modify the jenkins_home
variable:
jenkins_home: '{{ jenkins_os_homes[ansible_system|lower] }}{{ jenkins_user }}'
This probably should be initially only done in host_vars
since we have a heterogeneous setup.
Lets call the new host windows-04
for now, we can always rename it.
I've ordered and received new AX42 host. Connected to the host with Remmina RDP. One important thing to notice is that your connection Security protocol negotiation
needs to be set to TLS protocol security
in the Advanced
tab of your RDP connection settings:
After connecting to the host I've ran the script, but prior to this there was an issue that script wasn't signed, so I had to resolve it by changing the execution policy:
Now I'll proceed with bootstrapping the host.
Managed to bootstrap fully windows-04
node and Jenkins agent. I had some issues with bootstrap windows playbook and within infra-ci bootstrap playbook to configure windows slaves:
become_method: runas
, become_user: admin
and ansible_shell_type: powershell
here in order for this to work
After successfully running these playbooks, I've ran the build but it failed because there is no cmake installed.
Okay, so actually there were issues with the status-desktop-setup
role because the script failed silently:
WARN 'vcredist2022' (14.40.33810.0) is already installed.
Use 'scoop update vcredist2022 --global' to install a new version.
WARN Purging previous failed installation of inno-setup.
ERROR 'inno-setup' isn't installed correctly.
Removing older version (6.3.3).
'inno-setup' was uninstalled.
Installing 'inno-setup' (6.3.3) [64bit] from 'extras' bucket
Loading innosetup-6.3.3.exe from cache
Checking hash of innosetup-6.3.3.exe ... ok.
Extracting innosetup-6.3.3.exe ... ERROR Exit code was 1!
Failed to extract files from C:\ProgramData\scoop\apps\inno-setup\6.3.3\innosetup-6.3.3.exe.
Log file:
C:\ProgramData\scoop\apps\inno-setup\6.3.3\innounp.log
Please try again or create a new issue by using the following link and paste your console output: https://github.com/ScoopInstaller/Extras/issues/new?title=inno-setup%406.3.3%3a+decompress+error
And apparently there is an [issue and workaround](https://github.com/ScoopInstaller/Extras/issues/13911) for this. After trying it out the script installed packages smoothly. I will add these steps to the role.
Another thing is [this block](https://github.com/status-im/infra-ci/blob/9fb0f567b8c484e3f08341ba553d99d82dcb7546/ansible/roles/status-desktop-setup/tasks/win32nt.yml#L13C1-L17C30) which also fails silently:
changed: [windows-04.he-eu-hel1.ci.devel] => { "changed": true, "cmd": "scoop hold git", "delta": "0:00:00.544819", "end": "2024-08-29 08:59:30.253400", "rc": 0, "start": "2024-08-29 08:59:29.708581" }
STDOUT:
ERROR 'git' is not installed.
But if we do `scoop list` on the host, we will see that git is actually installed:
$ scoop list Installed apps:
Name Version Source Updated Info
innounp-unicode 1.72 versions 2024-08-29 10:54:07 ntop 0.3.4 main 2024-08-29 10:09:17 7zip 24.08 main 2024-08-28 13:49:13 Global install ag 2.2.5 C:\Users\Administrator\scoop\buckets\main\bucket\ag.json 2024-08-28 17:16:44 Global install cacert 2024-07-02 main 2024-08-29 09:02:59 Global install cmake 3.30.2 main 2024-08-29 10:05:11 Global install cmder 1.3.25 main 2024-08-28 17:16:10 Global install dark 3.14 main 2024-08-28 13:48:53 Global install dd 0.6beta3 main 2024-08-28 17:16:11 Global install diffutils 3.6 C:\Users\Administrator\scoop\buckets\main\bucket\diffutils.json 2024-08-28 17:16:45 Global install dos2unix 7.5.0 C:\Users\Administrator\scoop\workspace\dos2unix.json 2024-08-28 17:16:45 Global install findutils 4.4.2 C:\Users\Administrator\scoop\buckets\main\bucket\findutils.json 2024-08-28 17:16:45 Global install firefox 129.0.2 extras 2024-08-28 17:16:18 Global install gcc 13.2.0 main 2024-08-29 10:58:40 Global install git 2.46.0 main 2024-08-28 13:49:24 Global install ...
and we should actually be using `scoop hold --global git`.
After finally being able to run the build, the results are not as expected. The build time is approximately the same: average build is 24min on AX41 and I got 22min on AX42.
I was also inspecting the load on the host, but nothing much was happening. The resources were under utilized for the whole time. This screenshot is from the Package
stage of the build.
I don't think vertical scaling of the host will bring that much of a speedup.
Also, the failing build which failed with an error:
11:33:32 Warning: Cannot find Visual Studio redist directory, C:\Program Files (x86)\Microsoft Visual Studio\2017\BuildTools\VC\redist.
...
11:33:32 mv: cannot stat 'tmp/windows/dist/Status/bin/vc_redist.x64.exe': No such file or directory
11:33:32 make: *** [Makefile:765: pkg/StatusIm-Desktop-240829-091632-69cdd3-nightly-x86_64.exe] Error 1
happened because it was looking at the wrong directory for the redist
. This problem occured because we set the env variable VCINSTALLDIR here and this is executed after we run the Jenkins agent executable so it doesn't have the required environment variable.
If we look at the order of the roles that are executed for configuring Windows slaves agents, we can see that windows-jenkins-agent
is executed before status-desktop-setup
here.
After restarting Jenkins agent on the host, the issue is gone.
I've tested the status-desktop build by modifying the parameters here and adding flag --jobs
with value equal to the number of vcpus of the machine. The running time gets better by approximately 1min:
Maybe it would be worth adding the same flag to the nimbus build system
P.S. I've discovered that the function from jenkins lib utils.getProcCount()
returns wrong number of CPU's. We are using this to set MAKEFLAGS env variable where the value is used for -j
flag, we should fix this.
Yes, we should fix utils.getProcCount()
to maybe call something like nproc
.
As for --jobs
flag for Git, definitely a good idea. But I would also like to see if we can improve windows performance with Nim compiler flags like --threads
.
Here are the results with different parameter values for --jobs
submodules flag, -j
make flag and nim compiler --threads
flag:
Now the deps
and package
steps seem to be longer than yesterday. I was looking at some more flags for optimizations but couldn't something that hasn't already been set within Makefile.
One thing that I don't understand is usage of nimcache
. It is stored within the repo that is being cloned at start of the build and when build finishes it is wiped out along with the whole repo. Is it supposed to be used as a cache between different build stages? Would it be an option to define the cache somewhere else where it would persist between builds and then be used accross builds?
We have had bad behavior during builds as a result of Nim cache, hence it is not preserved between builds. In general it is safest to not use cache in between builds to make sure that every build is separate and is not affected by things that are not in the repo.
Now, if use of Nim cache can make a BIG difference on Windows we could re-consider, if it's big enough that it might be worth taking on the risks associated with using even per-PR Nim cache.
The suggestion from Nimbus team to disable all filters on a drive that Jenkins agent uses (J:
) is not available as suggested in the article, but there was an explanation how these commands translate into registry entries, so I added these entries:
1. devdrv enable -> FsEnableDevDrive=1 in CCS\Control\FileSystem
2. disallowAv -> FltmgrDevDriveAllowAntivirusFilter=0 in CCS\Control\FilterManager
I've run CI build for windows, but there was no speed-up in the build time.
Description
A clean build on windows takes more than 30 mins (vs. 5 on MacOs). Something could be wrong in the build configuration on windows.
The purpose of this task is to analyse and optimise the windows build process to reduce the build time.
Reproduced on AMD ryzen extreme z1 with M2 SSD and 16 GB LPDDR5X
Motivation
This optimisation would have a big impact on the Jenkins jobs running on every PR and would also help windows devs.