jhuckaby / performa

A multi-server monitoring system with a web based UI.
Other
430 stars 22 forks source link

WebUI unresponsive #32

Open ChristophorusReyhan opened 2 days ago

ChristophorusReyhan commented 2 days ago

Performa's WebUI occasionally unresponsive while outputs error "ERR_EMPTY_RESPONSE". When I check the log, last log is:

[1730344245.132][2024-10-31 10:10:45][performa-monitor.local][31539][WebServer][error][ECONNRESET][Client error: undefined: socket hang up][{"id":"n/a","port":5512,"pending":38,"active":512,"sockets":29}]

Does this mean the server is overloaded? What can I do to mitigate this? Currently my workaround is making a script that checks the site, if it hangs up (indicating error empty response), it restarted the service. This is not optimal.

jhuckaby commented 2 days ago

How many servers do you have Performa monitoring?

That error message shows that the web server is totally overloaded. It has 512 active (concurrent) connections, with 38 stuck in a wait queue. That is a horrible situation.

I assume you either have hundreds or thousands of servers all reporting in, and/or you have an extremely underpowered Performa master server.

If the former (hundreds or thousands of servers), I HIGHLY recommend adding a max_sleep_ms property in your Performa Satellite configuration files on all your servers. This defaults to 5000 ms. I recommend 30000 ms for setups with hundreds or thousands of servers. This will randomly "spread out" the connections the servers make to the master server to report their metrics every minute.

See Scalability in the Satellite docs for more details on this.

Also, I'd recommend upgrading your master server too. It is currently very, very unhappy.

jhuckaby commented 2 days ago

I should have also mentioned, it could be trouble at the storage layer. Are you using local disk, S3, or other? If local disk, you'll need a fast SSD. If S3, make sure your server has a fast connection to AWS.

You can see how long storage transactions are taking by examining the Storage.log:

cat /opt/performa/logs/Storage.log | grep -E '\[transaction\]\[put\]'

These should be really fast, like only a few milliseconds. Example (S3):

[1730347621.944][2024-10-30 21:07:01][mendo.org][2456][Storage][transaction][put][hosts/mendo.org/data][{"elapsed_ms":25.576}]

Local disk should be even faster, like 3ms.

If you are seeing higher times, then something is wrong at the storage level (disk, S3 or other).

ChristophorusReyhan commented 2 days ago

How many servers do you have Performa monitoring?

Currently hundreds

If the former (hundreds or thousands of servers), I HIGHLY recommend adding a max_sleep_ms property in your Performa Satellite configuration files on all your servers. This defaults to 5000 ms. I recommend 30000 ms for setups with hundreds or thousands of servers. This will randomly "spread out" the connections the servers make to the master server to report their metrics every minute.

Okay, I will try this first and see if it fixes it.

I should have also mentioned, it could be trouble at the storage layer. Are you using local disk, S3, or other? If local disk, you'll need a fast SSD. If S3, make sure your server has a fast connection to AWS.

You can see how long storage transactions are taking by examining the Storage.log:

cat /opt/performa/logs/Storage.log | grep -E '\[transaction\]\[put\]'

These should be really fast, like only a few milliseconds. Example (S3):

[1730347621.944][2024-10-30 21:07:01][mendo.org][2456][Storage][transaction][put][hosts/mendo.org/data][{"elapsed_ms":25.576}]

Local disk should be even faster, like 3ms.

If you are seeing higher times, then something is wrong at the storage level (disk, S3 or other).

on average, it's 1ms since it's local disk. I don't think it's storage problem (even though it's only HDD).

I think this is mainly because node.js is single threaded and can't handle this many clients. CPU on node process is always active, and it hits 100% occasionally.

Overall, I think you made a great software, thank you very much!

ChristophorusReyhan commented 2 days ago

Hmm, I think creating multiple performa server and seperating the workload is a viable option too.