lsc-sde / iac-terraform-azure

MIT License
0 stars 0 forks source link

Container Registry GitOps #30

Closed qcaas-nhs-sjt closed 7 months ago

qcaas-nhs-sjt commented 9 months ago

https://github.com/lsc-sde/lsc-sde/issues/6

qcaas-nhs-sjt commented 8 months ago

@vvcb yes am doing a bit of extra work today because of the time missed on monday. I was having a lot of fun with this one yesterday, tried numerous different approaches to getting this to work:

I'm looking at kaniko which although created by google does appear to be configured to work on all the major cloud providers. This is supposedly daemonless meaning that it can be run without priveleged access on the host. If this works it could actually be a more secure way of building the container images, however we will need to do some small amount of work to get this working properly. At the moment I'm just going to spend some of today doing a PoC on using this to see if it actually works with the images that we use. If it does then I'm thinking that we can create our own controller borrowing some of the information collected from fluxes gitrepository CRD to let us know when the branch has been updated, when our controller sees this change it would then create a job on k8s which will run kaniko to build the image. However if the POC doesn't work then this may not be viable.

Failing this I would suggest that we talk about either using dockerhub or a public repository (we would need a policy exception for this). This is open source so I don't see a problem with this personally as it wouldn't really be much different to us using the other container images that are in dockerhub

vvcb commented 8 months ago

@qcaas-nhs-sjt - a policy exception for a public repository feels appropriate here to avoid an overly complicated workflow. Are you happy to speak to Phoenix regarding this?

For future use cases that require a private ACR for microservices that we build and deploy internally, we can revisit this - but this is not a significant priority at the moment.

qcaas-nhs-sjt commented 8 months ago

@vvcb further to the above you'll see that I've contacted Pheonix about this. They've come back with a suggestion, however I cannot currently get it working. Will continue to work with them on this.

For the record I was able to get kaniko working, the biggest issue with this would be the memory requirements for kaniko when building the jupyter datascience notebooks gets extremely high. I suspect this isn't just kaniko but the requirements of the commands run look quite hungry to me. This got to just shy of 9GB of memory use before completing, if it had grown any further then the node would have killed the container, luckily it didn't. This is after a number of changes to decrease the memory footprint using caches. If we do eventually go down this route we may want to support this with larger memory nodes with autoscaling rules so that it only provisions them when it needs to. Obviously ACR is the preferred solution so will continue on with that tomorrow

qcaas-nhs-sjt commented 8 months ago

Have asked again for a contact at MS that we can discuss whether this solution works

vvcb commented 8 months ago

@qcaas-nhs-sjt , following our discussion today, let's go ahead with the public repository on DockerHub. Several of the images we will be building over the coming months will be useful for the wider open-source TRE community anyway.

And as you said, a private repo should not replace good discipline around how we build these images.

We can revisit the need for a private container registry when the need arises.

Do we need additional risk mitigation if we host our images on DockerHub? I don't think we do for our use cases - but worth documenting here if any.