Closed aprilrieger closed 2 weeks ago
After reviewing the ingress-nginx-controller logs on the besties cluster I was able to see the errors that the tenant covenant.hykuup.com
may have been experiencing.
In the logs there are errors relating to crowsec:
2024/06/07 22:18:05 [error] 1949#1949: *3939872 connect() failed (111: Connection refused), client: 10.0.6.229, server: ~^(?<subdomain>[\w-]+)\.hykuup\.com$, request: "GET /assets/application-ac706023ff85121bba95713d72e8b2c64f75d1436a760b1263c6c0a87871aa7a.css HTTP/2.0", host: "covenant.hykuup.com", referrer: "https://covenant.hykuup.com/"
2024/06/07 22:18:05 [error] 1949#1949: *3939872 [lua] crowdsec.lua:600: Allow(): [Crowdsec] bouncer error: request failed: connection refused, client: 10.0.6.229, server: ~^(?<subdomain>[\w-]+)\.hykuup\.com$, request: "GET /assets/application-ac706023ff85121bba95713d72e8b2c64f75d1436a760b1263c6c0a87871aa7a.css HTTP/2.0", host: "covenant.hykuup.com", referrer: "https://covenant.hykuup.com/"
2024/06/07 22:18:05 [error] 1949#1949: *3939872 connect() failed (111: Connection refused), client: 10.0.6.229, server: ~^(?<subdomain>[\w-]+)\.hykuup\.com$, request: "GET /assets/application-c6e417f508888410f54c4560593ec9b171b82156711b0bcce66db15ea43a1ced.js HTTP/2.0", host: "covenant.hykuup.com", referrer: "https://covenant.hykuup.com/"
2024/06/07 22:18:05 [error] 1949#1949: *3939872 [lua] crowdsec.lua:600: Allow(): [Crowdsec] bouncer error: request failed: connection refused, client: 10.0.6.229, server: ~^(?<subdomain>[\w-]+)\.hykuup\.com$, request: "GET /assets/application-c6e417f508888410f54c4560593ec9b171b82156711b0bcce66db15ea43a1ced.js HTTP/2.0", host: "covenant.hykuup.com", referrer: "https://covenant.hykuup.com/"
2024/06/07 22:18:05 [error] 1949#1949: *3939872 connect() failed (111: Connection refused), client: 10.0.6.229, server: ~^(?<subdomain>[\w-]+)\.hykuup\.com$, request: "GET /system/logo_images/1/original/Covenant_Theological_Seminary_-_white_on_grey.png HTTP/2.0", host: "covenant.hykuup.com", referrer: "https://covenant.hykuup.com/"
2024/06/07 22:18:05 [error] 1949#1949: *3939872 [lua] crowdsec.lua:600: Allow(): [Crowdsec] bouncer error: request failed: connection refused, client: 10.0.6.229, server: ~^(?<subdomain>[\w-]+)\.hykuup\.com$, request: "GET /system/logo_images/1/original/Covenant_Theological_Seminary_-_white_on_grey.png HTTP/2.0", host: "covenant.hykuup.com", referrer: "https://covenant.hykuup.com/"
2024/06/07 22:18:05 [error] 1949#1949: *3939872 connect() failed (111: Connection refused), client: 10.0.6.229, server: ~^(?<subdomain>[\w-]+)\.hykuup\.com$, request: "GET /downloads/9a097ae0-564c-41c1-a1f5-6d6e4652f8e7?file=thumbnail HTTP/2.0", host: "covenant.hykuup.com", referrer: "https://covenant.hykuup.com/"
2024/06/07 22:18:05 [error] 1949#1949: *3939872 [lua] crowdsec.lua:600: Allow(): [Crowdsec] bouncer error: request failed: connection refused, client: 10.0.6.229, server: ~^(?<subdomain>[\w-]+)\.hykuup\.com$, request: "GET /downloads/9a097ae0-564c-41c1-a1f5-6d6e4652f8e7?file=thumbnail HTTP/2.0", host: "covenant.hykuup.com", referrer: "https://covenant.hykuup.com/"
When I go to the crowdsec namespace and look into the crowdsec pod I see that the pod has been restarting almost hourly with an error of: Last state: Terminated with 137: OOMKilled, started: Fri, Jun 7 2024 2:12:04 pm, finished: Fri, Jun 7 2024 3:18:04 pm
&& Last state: Terminated with 137: OOMKilled, started: Fri, Jun 7 2024 3:18:05 pm, finished: Fri, Jun 7 2024 4:06:30 pm
crowdsec-lapi-6d788bcb57-z5mw6_crowdsec-lapi.log ingress-nginx-controller-6c9ff5f569-fx6xz_controller (3).log ingress-nginx-controller-6c9ff5f569-ks4kh_controller (2).log
the repo.samvera tenant: https://assaydepot.slack.com/archives/C03CA8XRP3L/p1717805368959019 https://assaydepot.slack.com/archives/C03CA8XRP3L/p1717811140396729 had a slow to load issue at the same time I saw the oom killer on the crowdsec pod and the pod restarted.
I added a website monitor so we can track this specific tenant over the weekend: https://www.site24x7.com/app/client#/home/monitors/195989000072363003/Summary
I also see that the resources/request is 100MiB and resource/limit is set to 100MiB -- I upped it to 200MiB for the weekend to see if that is helpful at reducing the amount of OOM Killed.
the cordsec issue has been resolved but still seeing issues accross multiple hykuup tenants.
Looked at the cluster and sclaed up another ingress nginx so each node had one. But still seeing the sites on hykuup flap.
slack s3-engineering call for additional help: https://assaydepot.slack.com/archives/C0313NK5NMA/p1718232353329779
Seeing several bots hitting the svc hykuup-knapsack-production-hykuup-knapsack-production-hyrax-80
where they are getting https status code 200, listed below agents I have observed in the ingress-nginx logs on besties (added logs for review)
Agents seen:
(+http://www.facebook.com/externalhit_uatext.php)
(KHTML, like Gecko; compatible; ClaudeBot/1.0; +[claudebot@anthropic.com](mailto:claudebot@anthropic.com))
(KHTML, like Gecko; compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
(+https://www.semanticscholar.org/crawler)
log entries:
(+https://www.semanticscholar.org/crawler)
10.0.4.82 - - [12/Jun/2024:22:12:23 +0000] "GET /catalog?f%5Bsubject_sim%5D%5B%5D=Samvera+Community&locale=en HTTP/1.1" 200 11127 "-" "Mozilla/5.0 (compatible) SemanticScholarBot (+https://www.semanticscholar.org/crawler)" 327 23.377 [hykuup-knapsack-production-hykuup-knapsack-production-hyrax-80] [] 10.0.6.215:3000 11116 23.315 200 b99d05443a48aebfa84e322c8d9127a7
ClaudeBot/1.0; +claudebot@anthropic.com)
10.0.5.234 - - [12/Jun/2024:22:11:13 +0000] "GET /catalog?f%5Bcreator_sim%5D%5B%5D=Murdock%2C+Michael&f%5Bkeyword_sim%5D%5B%5D=T-shirt+design&f%5Bresource_type_sim%5D%5B%5D=Image&locale=es&view=gallery HTTP/2.0" 200 8856 "-" "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com)" 706 9.670 [hykuup-knapsack-production-hykuup-knapsack-production-hyrax-80] [] 10.0.6.215:3000 8879 9.645 200 c26813659658bd47f8ebcaafdad70427
(+http://www.facebook.com/externalhit_uatext.php) (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com) (KHTML, like Gecko; compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) (+https://www.semanticscholar.org/crawler)
10.0.4.82 - - [12/Jun/2024:22:09:57 +0000] "GET /catalog?f%5Bcontributor_sim%5D%5B%5D=University+of+Oregon&f%5Bcreator_sim%5D%5B%5D=Mellinger%2C+Margaret&f%5Bcreator_sim%5D%5B%5D=Sato%2C+Linda&f%5Bcreator_sim%5D%5B%5D=Barth%2C+Duncan&locale=en&per_page=20 HTTP/2.0" 200 39074 "-" "facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)" 240 6.192 [hykuup-knapsack-production-hykuup-knapsack-production-hyrax-80] [] 10.0.6.215:3000 39087 6.169 200 2e0fddc50c9cda7732d2b2cd0d3da7c1
(KHTML, like Gecko; compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
10.0.5.234 - - [12/Jun/2024:21:51:27 +0000] "GET /catalog/facet/keyword_sim?f%5Bcontributor_sim%5D%5B%5D=Northwestern+University&locale=es&per_page=20&view=list HTTP/2.0" 200 4172 "-" "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) Chrome/116.0.1938.76 Safari/537.36" 334 15.018 [hykuup-knapsack-production-hykuup-knapsack-production-hyrax-80] [] 10.0.5.54:3000 4195 14.990 200 1fd6f050ec8561d2da802564aa01dead
Added logs for review: hykuup-knapsack-production-hyrax-bfd9cd68f-qljqc_hyrax.log hykuup-knapsack-production-hyrax-bfd9cd68f-zlm8l_hyrax.log ingress-nginx-controller-844cb8786f-pgm5l_controller (2).log ingress-nginx-controller-844cb8786f-dgsnp_controller (2).log ingress-nginx-controller-844cb8786f-9njfj_controller (2).log ingress-nginx-controller-844cb8786f-4lb7h_controller (2).log
hykuup knapsack tenant covenant.hykuup.com, has communicated that the load times for their site are long. Please investigate.