Intermittent cases of getaddrinfo failing between frontend and backend

LennertVA commented 5 months ago

Describe the bug While playing around in Community Edition, using the docker images provided, error 500 happens very regularly - roughly one in 3 to 4 actions triggers one. According to debug output it is due to getaddrinfo sometimes failing. In every case, simply refreshing the interface once or twice makes it go away.

To Reproduce There are no steps needed to reproduce. It happens all over the interface, for any action that involves calling the backend, in roughly 25% of the cases.

Expected behavior No errors 500.

Screenshots Screenshots don't say much except "Error 500 - Internal Error", but whenever it happens this is the cause in the container logs:

frontend    | TypeError: fetch failed
frontend    |     at node:internal/deps/undici/undici:12500:13
frontend    |     at process.processTicksAndRejections (node:internal/process/task_queues:95:5) {
frontend    |   [cause]: Error: getaddrinfo ENOTFOUND backend
frontend    |       at GetAddrInfoReqWrap.onlookupall [as oncomplete] (node:dns:118:26) {
frontend    |     errno: -3008,
frontend    |     code: 'ENOTFOUND',
frontend    |     syscall: 'getaddrinfo',
frontend    |     hostname: 'backend'
frontend    |   }
frontend    | }

Environment (please complete the following information):

Server OS: official Docker container images imported into podman on RHEL8.9 x86_64
Client Browser: Firefox 125.0.2
CISO Assistant version: v1.3.5 build 5baf1fc1

Additional context The server OS runs SELinux in full enforcing mode. It took quite some relabeling of files and loading of custom policies to get it to run, but now that it runs it appears not to be involved in this (no audit logs of anything being blocked). Still worth mentioning perhaps.

It is particularly odd that it fails sometimes. And if it fails, a refresh usually does the trick. Which means it is not simply a case of something being broken or blocked, since it does work "usually". Is there a very short timeout configured somewhere for the call? The host server does run a noticeable load.

ab-smith commented 5 months ago

Hello @LennertVA

Thank you for the feedback. This reminds me of a discussion on Discord about a user with mutualised virtualisation and low resources. The trick was to re-bake the docker image with a higher timeout on gunicorn but will try to have an equivalent setup to yours in the meanwhile.

I've read that you've setup the labels but have you cleared out SELinux by switching temporarily to permissive mode?

Warm regards,

LennertVA commented 5 months ago

Hello @LennertVA

Thank you for the feedback. This reminds me of a discussion on Discord about a user with mutualised virtualisation and low resources.

Well, I wouldn't call it low resources, but the physical system is well-used at least. The VM running the docker images less so - it has a load under 0.5 and 50+% free memory - but I can't rule out that it sometimes has to wait for CPU or I/O a bit longer than what is usual due to higher-prio heavily loaded VMs being greedy.

The trick was to re-bake the docker image with a higher timeout on gunicorn but will try to have an equivalent setup to yours in the meanwhile.

Interesting. Definitely curious what you'll find. It mostly threw me off that the error is in getaddrinfo, I wasn't expecting something silly as network resolution to be the thing that would start shitting itself.

I've read that you've setup the labels but have you cleared out SELinux by switching temporarily to permissive mode?

I have not. The host runs some other containers too and I don't like the idea of disabling it globally to test a single case. SELinux seems to be happy enough - had to give some perms and set some labels but if it would be causing any blocking, it would be clearly logged which is no longer happening. But it could be worth testing though, you're right.

LennertVA commented 5 months ago

I've read that you've setup the labels but have you cleared out SELinux by switching temporarily to permissive mode?

Just did a setenforce 0, refreshed the CISO Assistant UI, clicked on a few random items (risk assessment et al) and faced half a dozen error 500s within the first 30 seconds. So it's safe to say that SELinux is not causing this.

The VM running the containers also has a load of ~0.25 (for two CPUs) and memory ~40% in use. So that also doesn't sound like a very likely cause.

ab-smith commented 5 months ago

Thanks @LennertVA for the feedback, Given that we are unable to reproduce this, we'll try to build an equivalent setup and get back to you. Regards

LennertVA commented 5 months ago

Thanks! I've been looking around for this particular error 3008, and a surprising lot of hits deal with multicast related services and mdns. Is CISO Assistant also using mdns in any way internally?

For good measure I turned off the local host firewall for a minute too - I honestly can't imagine how that would be the cause, but okay - and still no dice.

ab-smith commented 5 months ago

So after some digging, it seems that there are other softwares reporting this strange behaviour between node, undicti and docker during DNS resolution. Other people are suggesting tricks like this one but I'm not a big fan right now to specific a chain of DNS resolution: https://forum.weaviate.io/t/fetch-error-between-ts-client-and-weaviate-when-deployed-with-docker-compose-on-windows/2146/9

Regardless of these 3008 and DNS warnings, how does it translate to CISO Assistant? do you get any errors? still having random 500 errors after the update?

I've managed a way with a NUC to build a home lab with Proxmox and RHEL9 to emulate your case.

Thank you

LennertVA commented 5 months ago

So after some digging, it seems that there are other softwares reporting this strange behaviour between node, undicti and docker during DNS resolution. Other people are suggesting tricks like this one but I'm not a big fan right now to specific a chain of DNS resolution: https://forum.weaviate.io/t/fetch-error-between-ts-client-and-weaviate-when-deployed-with-docker-compose-on-windows/2146/9

Indeed, same thoughts here. Feels like a patch on a wooden leg. This should not happen regardless.

Regardless of these 3008 and DNS warnings, how does it translate to CISO Assistant? do you get any errors? still having random 500 errors after the update?

Yes, every third or fourth click results in an error 500 still.

ab-smith commented 4 months ago

Ok I've managed to create similar issues by artificially creating latency between the front and backend with toxiproxy; can you confirm that you haven't split the front from the back on different vm/hosts ? were using compose with the prebuilt images or locally built ones?

LennertVA commented 4 months ago

Ok I've managed to create similar issues by artificially creating latency between the front and backend with toxiproxy; can you confirm that you haven't split the front from the back on different vm/hosts ? were using compose with the prebuilt images or locally built ones?

Yes, front and back are on the same host and in the same "pod". There is nothing in between. Using the prebuilt images.

intuitem / ciso-assistant-community

Intermittent cases of getaddrinfo failing between frontend and backend #482