Open LennertVA opened 5 months ago
Hello @LennertVA
Thank you for the feedback. This reminds me of a discussion on Discord about a user with mutualised virtualisation and low resources. The trick was to re-bake the docker image with a higher timeout on gunicorn but will try to have an equivalent setup to yours in the meanwhile.
I've read that you've setup the labels but have you cleared out SELinux by switching temporarily to permissive mode?
Warm regards,
Hello @LennertVA
Thank you for the feedback. This reminds me of a discussion on Discord about a user with mutualised virtualisation and low resources.
Well, I wouldn't call it low resources, but the physical system is well-used at least. The VM running the docker images less so - it has a load under 0.5 and 50+% free memory - but I can't rule out that it sometimes has to wait for CPU or I/O a bit longer than what is usual due to higher-prio heavily loaded VMs being greedy.
The trick was to re-bake the docker image with a higher timeout on gunicorn but will try to have an equivalent setup to yours in the meanwhile.
Interesting. Definitely curious what you'll find. It mostly threw me off that the error is in getaddrinfo, I wasn't expecting something silly as network resolution to be the thing that would start shitting itself.
I've read that you've setup the labels but have you cleared out SELinux by switching temporarily to permissive mode?
I have not. The host runs some other containers too and I don't like the idea of disabling it globally to test a single case. SELinux seems to be happy enough - had to give some perms and set some labels but if it would be causing any blocking, it would be clearly logged which is no longer happening. But it could be worth testing though, you're right.
I've read that you've setup the labels but have you cleared out SELinux by switching temporarily to permissive mode?
Just did a setenforce 0, refreshed the CISO Assistant UI, clicked on a few random items (risk assessment et al) and faced half a dozen error 500s within the first 30 seconds. So it's safe to say that SELinux is not causing this.
The VM running the containers also has a load of ~0.25 (for two CPUs) and memory ~40% in use. So that also doesn't sound like a very likely cause.
Thanks @LennertVA for the feedback, Given that we are unable to reproduce this, we'll try to build an equivalent setup and get back to you. Regards
Thanks! I've been looking around for this particular error 3008, and a surprising lot of hits deal with multicast related services and mdns. Is CISO Assistant also using mdns in any way internally?
For good measure I turned off the local host firewall for a minute too - I honestly can't imagine how that would be the cause, but okay - and still no dice.
So after some digging, it seems that there are other softwares reporting this strange behaviour between node, undicti and docker during DNS resolution. Other people are suggesting tricks like this one but I'm not a big fan right now to specific a chain of DNS resolution: https://forum.weaviate.io/t/fetch-error-between-ts-client-and-weaviate-when-deployed-with-docker-compose-on-windows/2146/9
Regardless of these 3008 and DNS warnings, how does it translate to CISO Assistant? do you get any errors? still having random 500 errors after the update?
I've managed a way with a NUC to build a home lab with Proxmox and RHEL9 to emulate your case.
Thank you
So after some digging, it seems that there are other softwares reporting this strange behaviour between node, undicti and docker during DNS resolution. Other people are suggesting tricks like this one but I'm not a big fan right now to specific a chain of DNS resolution: https://forum.weaviate.io/t/fetch-error-between-ts-client-and-weaviate-when-deployed-with-docker-compose-on-windows/2146/9
Indeed, same thoughts here. Feels like a patch on a wooden leg. This should not happen regardless.
Regardless of these 3008 and DNS warnings, how does it translate to CISO Assistant? do you get any errors? still having random 500 errors after the update?
Yes, every third or fourth click results in an error 500 still.
Ok I've managed to create similar issues by artificially creating latency between the front and backend with toxiproxy; can you confirm that you haven't split the front from the back on different vm/hosts ? were using compose with the prebuilt images or locally built ones?
Ok I've managed to create similar issues by artificially creating latency between the front and backend with toxiproxy; can you confirm that you haven't split the front from the back on different vm/hosts ? were using compose with the prebuilt images or locally built ones?
Yes, front and back are on the same host and in the same "pod". There is nothing in between. Using the prebuilt images.
Describe the bug While playing around in Community Edition, using the docker images provided, error 500 happens very regularly - roughly one in 3 to 4 actions triggers one. According to debug output it is due to getaddrinfo sometimes failing. In every case, simply refreshing the interface once or twice makes it go away.
To Reproduce There are no steps needed to reproduce. It happens all over the interface, for any action that involves calling the backend, in roughly 25% of the cases.
Expected behavior No errors 500.
Screenshots Screenshots don't say much except "Error 500 - Internal Error", but whenever it happens this is the cause in the container logs:
Environment (please complete the following information):
Additional context The server OS runs SELinux in full enforcing mode. It took quite some relabeling of files and loading of custom policies to get it to run, but now that it runs it appears not to be involved in this (no audit logs of anything being blocked). Still worth mentioning perhaps.
It is particularly odd that it fails sometimes. And if it fails, a refresh usually does the trick. Which means it is not simply a case of something being broken or blocked, since it does work "usually". Is there a very short timeout configured somewhere for the call? The host server does run a noticeable load.