Use k8s node-affinity to try to get hg pds and volumes on the same node

myieye commented 5 months ago

We can see that performance took a hit after our release Yesterday: Apparently @hahn-kev talked to TechOps and the hg pods and volumes were on:

the same node before the release
different nodes after the release

So, that's the working theory for the performance regression. hgweb presumably uses a ton of file-system reads when it updates its "repo index". So, being on a different node almost certainly makes a noticeable difference.

It's not NEARLY as bad as it was before. So, we're not panicing, but there's room for improvement. And after seeing how good it can be, I find it hard to be satisfied with the current situation.

There are several options here:

nodeAffinity: prefer certain node(s) based on labels
nodeSelector: force certain node(s) based on labels
(not recommended) nodeName: force node selection based on name
Local Persistent Volumes: Perhaps the most performant, but with some complex (sounding for me) caveats

nodeAffinity sounds like the simplest decent idea. It's a bit dissatisfying, because we don't really care what node things land on we just want them to be on the same node. So, in that case Local Persistent Volumes may be more suitable.

hahn-kev commented 5 months ago

I don't know if we have access to local volumes, we would need to talk to LT ops about that as it may not be available.

That said a RWO volume would be a similar solution that will perform better (I believe it's actually similar to local volumes). It will also require node affinity so that multiple pods can access the volume, right now both lexbox and hgweb need access to the file system.

tim-eves commented 5 months ago

Hi Tim, Kevin,

I'd be surprised if non locality of the storage should make that much of a difference. RWX is just NFS, and RWO is iSCSI, so not always local after a migration either either just exclusive. It's plausible hg doesn't perform well with non local NFS I guess, but I'd expect that's to be a known issue by now, so it should show up in but trackers etc. Kevin is right about the need for node affinity to keep the container on the same host as the PV. But the networking between nodes is 10Gbps, so I don't see how non-locality could slow everything down. For reference SATA is 6Gbps, and the connection to internet client is much slower, capped at 1Gbps for AWS, or 200Mbps for Dallas.

If it's currently happening, or next time it does, could you run fio from an hgweb container against the PV mount point and also against a dir not on any PV mount point (non-PV backed filesystems always use node local storage). That would rule in or out locality of storage.

God bless, Tim

On Thu, 18 Apr 2024, 00:56 Kevin Hahn, @.***> wrote:

I don't know if we have access to local volumes, we would need to talk to LT ops about that as it may not be available.

That said a RWO volume would be a similar solution that will perform better (I believe it's actually similar to local volumes). It will also require node affinity so that multiple pods can access the volume, right now both lexbox and hgweb need access to the file system.

— Reply to this email directly, view it on GitHub https://github.com/sillsdev/languageforge-lexbox/issues/732#issuecomment-2061890925, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABIW5IED77NIZMTELAAIPNDY52ZTDAVCNFSM6AAAAABGK6JKH2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANRRHA4TAOJSGU . You are receiving this because you are subscribed to this thread.Message ID: @.***>

hahn-kev commented 5 months ago

I think the problem has less to do with bandwidth than with iops, so gbps doesn't really matter if the latency is high

hahn-kev commented 5 months ago

before we attempt to do this I think we need to measure the difference, mostly in IOPS, less in bandwidth.

rmunn commented 4 months ago

I've pretty much proved that some of our issues (such as https://github.com/sillsdev/languageforge-lexbox/issues/765 and https://github.com/sillsdev/languageforge-lexbox/issues/728) are caused by NFS: the LexBox API pod changes the filesystem (creating a new project, or resetting an existing project's repo to have a different root commit), but the HgWeb pod doesn't see the change for a while (typically 30-60 seconds in my experience).

All my attempts to solve the problem so far have failed. For example, in https://github.com/sillsdev/languageforge-lexbox/pull/789 I ran sync on the LexBox API pod, hoping that this would force NFS to flush its client cache to the server and therefore let the HgWeb pod see the change sooner. But even after running sync, it takes roughly 30 seconds before the HgWeb pod has the same view of the Mercurial repo that the LexBox pod does. This has caused us much frustration as our integration tests are producing false failures when the HgWeb pod has an outdated view of the filesystem, or else the tests time out while we wait for HgWeb to see the "correct" filesystem state.

Are there any ReadWriteMany volume types we could use that aren't backed by NFS? Something that would allow us to make a change in one pod, and have the other pod reliably see the same change (even if we have to manually force a sync) would solve a lot of our issues.

rmunn commented 4 months ago

A drawback of ReadWriteOnce is that deploying can't use the "spin up second pod before spinning down first pod", so you end up with service interruptions. The first pod has to spin down first, then the second pod can spin up, and if the spinup time is long then you can end up with a service outage of several minutes. Plus, if the spinup of the new pod fails for some reason, your service is down until you can bring the original pod back up (which is sometimes tricky if the volume has now been "assigned" to the pod that's failing).

ReadWriteMany allows a much safer deployment process... but if it's at the cost of consistent integration test failures, I'm not sure it's worth it anymore.

tim-eves commented 4 months ago

You can use rolling updates with RWO, but you need to set the node affinity to ensure there new pod stats on the same node as the old. RWO volumes can be mounted only a single node at a time, but any number of pods on that node.

On Wed, 15 May 2024, 10:59 Robin Munn, @.***> wrote:

A drawback of ReadWriteOnce is that deploying can't use the "spin up second pod before spinning down first pod", so you end up with service interruptions. The first pod has to spin down first, then the second pod can spin up, and if the spinup time is long then you can end up with a service outage of several minutes. Plus, if the spinup of the new pod fails for some reason, your service is down until you can bring the original pod back up (which is sometimes tricky if the volume has now been "assigned" to the pod that's failing).

ReadWriteMany allows a much safer deployment process... but if it's at the cost of consistent integration test failures, I'm not sure it's worth it anymore.

— Reply to this email directly, view it on GitHub https://github.com/sillsdev/languageforge-lexbox/issues/732#issuecomment-2112083065, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABIW5ICMLG76AKQ5KJSDXLDZCMWXXAVCNFSM6AAAAABGK6JKH2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMJSGA4DGMBWGU . You are receiving this because you commented.Message ID: @.***>

rmunn commented 4 months ago

Note that in addition to node affinity, there's also pod affinity: https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#inter-pod-affinity-and-anti-affinity. That says "I don't care about the node labels, but I want the hgweb pod on the same node as the lexbox pod".

sillsdev / languageforge-lexbox

Use k8s node-affinity to try to get hg pds and volumes on the same node #732