Fix fallout of Let's Encrypt root an intermediate certificate expiration
Let's Encrypt DST Root CA X3 certificate expired on 30 September 2021.
Normally TLS libraries should have switched automatically to the existing valid replacement.
However the version of OpenSSL (1.1.0l) in Debian 9 (Stretch) had a bug preventing switching to the new certificate.
This caused composer to fail to connect.
The fix is to migrate our PHP containers' base image from Debian 9 (Stretch) to Debian 10 (Buster).
The later has version 1.1.1d of OpenSSL which doesn't present that buggy behaviour.
However moving to major new version of base image is not trivial a few stuff broke and needed to be fixed.
ALL breakage were fixed except for css_check which I've commented out for now until we figure why it doesn't work.
Client libraries for PostgreSQL 9.6 are no longer available in Debian 10, so I pulled version 11 of PostgreSQL client libraries.
There doesn't appear to be problem with connecting to PostgreSQL 9.6 server using that version.
The temporary fix would have been to disable certs validation as an option to the Ansible yum module call, but
that would reduce the security of our systems.
The proper fix is to upgrade Centos used as the OS on the EC2 instances from Centos 7 to Centos 8.4.
That required changes:
In the docker-install role we need to remove the "enable Docker-CE repo" step otherwise error happens
In the postgres-preinstall role we need to install the Centos 8 PostgreSQL repo
In the ansible-role-postgresql role configuration in playbook.yml, we need to use PostgreSQL 11 and enable sudo with become
Improve caching in GitLab jobs and Docker build to speed up GitLab jobs
We have caching already for our custom images. Several other techniques are now applied (or restored):
Local cache of base image from Docker Hub
the base image used in our containers are pulled from Docker Hub, but because it's GitLab job is isolated and create its own instance of docker-dind,
the base are never available locally when a docker build command is triggered, requring pulling them for each job.
What we do is create a new preliminary GitLab stage (.pre), where we have a job to pull once all the base image we use in the project. We then save them as a TAR file archive. We then use GitLab artifacts functionality to make those files available to all subsequents stages.
In the jobs that need to build container images, we front the jobs' steps with a few line to load the TAR archive as local docker images so the build process doesn't need to pull them remotely.
Authenticated login to Docker Hub
Until now, all the pull to Docker Hub are anonymous but they have rate limits for the number of pulls per period of time and these rates are different for anonymous users, for logged in free user and for paying users.
By logging with our Docker Hub account (which has to be set in GitLab variables) we increase our pull capacity.
Caching of composer libraries
Since we have locked version with composer.lock, the vendor library can be cached between jobs.
To do so, we use the GitLab cache functionality that make a list of paths (in our case Composer files and directories) available across jobs, stages and pipelines of the same project.
We had that configuration before, but it disappeared, so we restore it.
Fixed precise versions of base image
Until now, we tend to use latest or x.y (major version) as image tag when specifying base image for our Dockerfile.
The problem is that those container image can be updated whenever a minor version is released causing our cached image to be invalidated and trigger their pull and rebuild of our custom images.
Additionally, we don't have certainty on which version this loose tags base images are at, as the upgrade is not audited on our side.
Instead, we use precise tag for the base image (x.y.z) so to remove any chance for the base image to change.
We will manage upgrade of our infrastructure ourselves with our own auditing.
Comment out docker pull and build instructions related to FUW
the container services associate with File Upload Wizard are not deployed on production environment, so there's no need
to pull and build their images.
Fix tag for custom images
The production images built in the build job are tagged with the environment they are for.
Unfortunately, when they were pulled before the build the wrong tag (latest instead of stagging or live) was used, causing the docker build to think there is no image so it build entirely new production images again.
An associatated bug was the absence of the environment specific tag in docker-compose-build.yml where we define what image to use as cache.
Move transcient docker instructions at the end of the Dockefile
In the Production-Web-Dockerfile, the block RUN apk ... was triggered every time causing extra compile time becuase previous layer is constantly invalidated by definition (the block creating site config which never persist).
By moving the RUN apk ... before the Site config block we enable the compilation stage to be cached.
Tests in CI
For the last few weeks we've made a lot of change to the infrastructure without running the test suites in CI, when I reactivated the test job, it failed mostly because of those changes, so the test job had to be adapted. The main change is that we now use the up.sh project setup script on CI test job as well.
I've also added an block to functional tests in protected/tests/phpunit.xml to exclude flaky tests and those related to File Upload Wizard.
The other change is that Composer installed binaries like phpunit, beat, phpcov need to be fully referenced because on CI we don't have the bin/ symlink in project directory.
Finally the main gigadb test suite CI job is now re-instated.
Bug fix
Fix the pg_dump command in convert_production_db_to_latest_ver.sh to use the version used by @pli888
TODO:
[x] looking at build logs, I see that a lot of times is spent building the php base image, these changes will this happen only once per job, as the container is then available locally when building all our container image
[x] by logging with Docker Hub account credentials, the process is no longer anonymous and it's highly possible that Docker shapes the bandwidth to their servers differently wether we are anonymous or logged in (not least because they have set up rate limits for docker pulls)
[x] Our pipelines already use artifacts to pass data between job, we will pass docker base images the same way
[x] The credentials for Docker Hub account (username and access token) need to be stored in GitLab varfiables set for the "All (default)" environment option
Fix fallout of Let's Encrypt root an intermediate certificate expiration
Let's Encrypt DST Root CA X3 certificate expired on 30 September 2021. Normally TLS libraries should have switched automatically to the existing valid replacement. However the version of OpenSSL (1.1.0l) in Debian 9 (Stretch) had a bug preventing switching to the new certificate. This caused composer to fail to connect. The fix is to migrate our PHP containers' base image from Debian 9 (Stretch) to Debian 10 (Buster). The later has version 1.1.1d of OpenSSL which doesn't present that buggy behaviour. However moving to major new version of base image is not trivial a few stuff broke and needed to be fixed. ALL breakage were fixed except for
css_check
which I've commented out for now until we figure why it doesn't work. Client libraries for PostgreSQL 9.6 are no longer available in Debian 10, so I pulled version 11 of PostgreSQL client libraries. There doesn't appear to be problem with connecting to PostgreSQL 9.6 server using that version.The composer error can be replicated with:
Another problem is Ansible playbook. The following error was thrown when running a playbook that use the
yum
module:The temporary fix would have been to disable certs validation as an option to the Ansible
yum
module call, but that would reduce the security of our systems. The proper fix is to upgrade Centos used as the OS on the EC2 instances from Centos 7 to Centos 8.4. That required changes:docker-install
role we need to remove the "enable Docker-CE repo" step otherwise error happenspostgres-preinstall
role we need to install the Centos 8 PostgreSQL repoansible-role-postgresql
role configuration inplaybook.yml
, we need to use PostgreSQL 11 and enable sudo withbecome
Improve caching in GitLab jobs and Docker build to speed up GitLab jobs
We have caching already for our custom images. Several other techniques are now applied (or restored):
Local cache of base image from Docker Hub
the base image used in our containers are pulled from Docker Hub, but because it's GitLab job is isolated and create its own instance of
docker-dind
, the base are never available locally when adocker build
command is triggered, requring pulling them for each job.What we do is create a new preliminary GitLab stage (
.pre
), where we have a job to pull once all the base image we use in the project. We then save them as a TAR file archive. We then useGitLab artifacts
functionality to make those files available to all subsequents stages. In the jobs that need to build container images, we front the jobs' steps with a few line to load the TAR archive as local docker images so the build process doesn't need to pull them remotely.Authenticated login to Docker Hub
Until now, all the pull to Docker Hub are anonymous but they have rate limits for the number of pulls per period of time and these rates are different for anonymous users, for logged in free user and for paying users. By logging with our Docker Hub account (which has to be set in GitLab variables) we increase our pull capacity.
Caching of composer libraries
Since we have locked version with composer.lock, the vendor library can be cached between jobs. To do so, we use the
GitLab cache
functionality that make a list of paths (in our case Composer files and directories) available across jobs, stages and pipelines of the same project.We had that configuration before, but it disappeared, so we restore it.
Fixed precise versions of base image
Until now, we tend to use
latest
orx.y
(major version) as image tag when specifying base image for our Dockerfile. The problem is that those container image can be updated whenever a minor version is released causing our cached image to be invalidated and trigger their pull and rebuild of our custom images. Additionally, we don't have certainty on which version this loose tags base images are at, as the upgrade is not audited on our side.Instead, we use precise tag for the base image (x.y.z) so to remove any chance for the base image to change. We will manage upgrade of our infrastructure ourselves with our own auditing.
Comment out docker pull and build instructions related to FUW
the container services associate with File Upload Wizard are not deployed on production environment, so there's no need to pull and build their images.
Fix tag for custom images
The production images built in the build job are tagged with the environment they are for. Unfortunately, when they were pulled before the build the wrong tag (
latest
instead ofstagging
orlive
) was used, causing thedocker build
to think there is no image so it build entirely new production images again. An associatated bug was the absence of the environment specific tag indocker-compose-build.yml
where we define what image to use as cache.Move transcient docker instructions at the end of the Dockefile
In the
Production-Web-Dockerfile
, the blockRUN apk ...
was triggered every time causing extra compile time becuase previous layer is constantly invalidated by definition (the block creating site config which never persist). By moving theRUN apk ...
before the Site config block we enable the compilation stage to be cached.Tests in CI
For the last few weeks we've made a lot of change to the infrastructure without running the test suites in CI, when I reactivated the test job, it failed mostly because of those changes, so the test job had to be adapted. The main change is that we now use the block to functional tests in
up.sh
project setup script on CI test job as well. I've also added anprotected/tests/phpunit.xml
to exclude flaky tests and those related to File Upload Wizard.The other change is that Composer installed binaries like
phpunit
,beat
,phpcov
need to be fully referenced because on CI we don't have thebin/
symlink in project directory.Finally the main gigadb test suite CI job is now re-instated.
Bug fix
Fix the
pg_dump
command inconvert_production_db_to_latest_ver.sh
to use the version used by @pli888TODO: