prometheus / pushgateway

Push acceptor for ephemeral and batch jobs.
Apache License 2.0
3.01k stars 467 forks source link

Lots of CLOSE_WAIT connections consume massive memory. #340

Closed alpaca2333 closed 4 years ago

alpaca2333 commented 4 years ago

Inside a pushgateway pod, check with netstat:

/pushgateway $ netstat -an | grep CLOSE_WAIT | wc -l
994

pprof through kubectl port-forward: image

I in fact pushed large size of metrics to pushgateway from over 300 machines. They are consistent hashed to 30 pushgateways. However, the memory consumptions among them have huge difference. According to wget -O - localhost:9091/metrics | wc -l inside their pods, the amount of metrics on a pushgateway did not various a lot. Those who consumed a lot of memory, mostly had lots of unclosed connections as above.

By the way, the client has a 1 second timeout context. so the client must have closed the connection.

$ kubectl top po -npg                                                                                                                                                                                                                                                                                   [17:51:58]
NAME                                        CPU(cores)   MEMORY(bytes)   
pg-prometheus-pushgateway-657f7d88b-46wtg   1154m        6267Mi          
pg-prometheus-pushgateway-657f7d88b-4cwwp   206m         53Mi            
pg-prometheus-pushgateway-657f7d88b-4zdbd   146m         42Mi            
pg-prometheus-pushgateway-657f7d88b-69xms   423m         80Mi            
pg-prometheus-pushgateway-657f7d88b-76h69   1057m        114Mi           
pg-prometheus-pushgateway-657f7d88b-9ht9t   379m         76Mi            
pg-prometheus-pushgateway-657f7d88b-bdb4g   401m         83Mi            
pg-prometheus-pushgateway-657f7d88b-cv2mg   222m         56Mi            
pg-prometheus-pushgateway-657f7d88b-d9zw6   366m         72Mi            
pg-prometheus-pushgateway-657f7d88b-fs2z5   784m         93Mi            
pg-prometheus-pushgateway-657f7d88b-fvczh   617m         85Mi            
pg-prometheus-pushgateway-657f7d88b-g9hbh   467m         79Mi            
pg-prometheus-pushgateway-657f7d88b-hq9ch   1181m        3410Mi          
pg-prometheus-pushgateway-657f7d88b-j2bhk   176m         82Mi            
pg-prometheus-pushgateway-657f7d88b-jxqdq   818m         95Mi            
pg-prometheus-pushgateway-657f7d88b-k9ztr   466m         71Mi            
pg-prometheus-pushgateway-657f7d88b-l2scz   1174m        1264Mi          
pg-prometheus-pushgateway-657f7d88b-l5gcs   899m         85Mi            
pg-prometheus-pushgateway-657f7d88b-llrx6   746m         95Mi            
pg-prometheus-pushgateway-657f7d88b-m6tdm   595m         66Mi            
pg-prometheus-pushgateway-657f7d88b-mc9mx   1227m        282Mi           
pg-prometheus-pushgateway-657f7d88b-mhxck   295m         58Mi            
pg-prometheus-pushgateway-657f7d88b-pxqgs   290m         77Mi            
pg-prometheus-pushgateway-657f7d88b-qfm2c   225m         50Mi            
pg-prometheus-pushgateway-657f7d88b-rmrm5   1138m        7040Mi          
pg-prometheus-pushgateway-657f7d88b-srkzn   398m         89Mi            
pg-prometheus-pushgateway-657f7d88b-sv8ch   746m         80Mi            
pg-prometheus-pushgateway-657f7d88b-wbxmp   840m         93Mi            
pg-prometheus-pushgateway-657f7d88b-x52qs   1165m        1531Mi          
pg-prometheus-pushgateway-657f7d88b-xrsgk   688m         77Mi   

Please help, thanks.

alpaca2333 commented 4 years ago

pprof results: top: image

flame: image

alpaca2333 commented 4 years ago

i disabled the consistency check.

Drewster727 commented 4 years ago

@alpaca2333 I am seeing the same or very similar issue. I was seeing my pushgateway climb in memory and eventually go OOM. I do see CLOSE_WAITS piling up little by little. What's the deal here?

Drewster727 commented 4 years ago

@alpaca2333 yeah, I see what you mean about the consistency check. I shut it off as well via --push.disable-consistency-check and my problem went away. Still seems like a bug no?

alpaca2333 commented 4 years ago

@Drewster727 No, i think it is not a bug. CLOSE_WAIT is produced by servers not closing the connection in time. According to the above profiles, pushgateway cannot process requests as fast as they comes in. As mentioned in comments, the consistency check is very heavy. Each time you push metrics to pushgateway, the check will call gather.Gather(), in which it will sort.Sort() all your metrics you pushed. Just simply disable it.

beorn7 commented 4 years ago

Also, please consider re-architecting your setup. If you put sufficient load onto your Pushgateway to make the performance overhead of the consistency check matter, you are almost certainly using the Pushgateway for something it was not designed for. It might look things are just fine, but your setup is fundamentally brittle.