river-build / river

MIT License
16 stars 5 forks source link

pg instance failure due to OOM #1091

Open jterzis opened 1 week ago

jterzis commented 1 week ago

Describe the bug An operator's cloud pg instance went down ostensibly due to OOM error. During the incident time, a connection spike in node to pg connections was seen. Unclear whether stream node created connections due to organic traffic from clients or as a response to pg service interruption from OOM error. Confirm nodes do not DOS pg with new connections on pg failures.

To Reproduce Steps to reproduce the behavior:

Expected behavior Confirm stream nodes do not create new pg connections hyperactively when pg service interrupts or any other code paths in the stream node that create inorganic pg connections (uncorrelated to actual client requests).

Screenshots telegram-cloud-photo-size-1-5138982082981244551-y

Screenshot 2024-09-17 at 2 52 57 PM

Logs

Additional context operator was running 30gb memory single pg instance v14 against 4 mainnet nodes. After OOM error, upgraded to 100GB memory.

sergekh2 commented 6 days ago

lets wait for upgrade to pg 16 and then proceed if still the problem