Reduce retention times for archives and monitoring data

redhat-developer / osd-monitor-poc

8 stars 20 forks source link

Reduce retention times for archives and monitoring data #59

Closed miteshvp closed 4 years ago

miteshvp commented 4 years ago

Trying this PR since getting lots of console messages

[Wed Nov 6 09:45:13] pmmgr(1/1): /etc/pcp/pmmgr-pod: log directory /var/log/pcp/pmmgr 91% full, adjusting to 67% retention times

fche commented 4 years ago

By the way, if the actual file system fullness numbers don't get up to the 99%ish mark during any operation, then you don't really have to manually tune the numbers. Those "adjusting..." messages are warnings that the system is noticing low space and itself automatically reducing retention figures.

miteshvp commented 4 years ago

@fche - that's a relief. Good to know that. Reason I am trying all these things is, all of a sudden, prometheus data is not sent to app-sre-grafana dashboard. Found out,

pcp-prometheus-in was timing out getting the data from actual promethus metrics.
/etc/pcp/pmmgr-pod was getting killed time and again alongwith those warnings around storage.
liveness is failing very often for pcp-central-webapi pod. Just created #60 to see if it stabilizes.

Biggest concern is these all trial and errors are not helping. Do you have any clue where to look at?

fche commented 4 years ago

Let me log on and take a peek at the logs, if I still have even that much access. Is there somewhere we can chat more live?

fche commented 4 years ago

I no longer have 'edit' (terminal logon) privileges on console.dsaas.openshift.com so my insight is limited. However, the logs of the pcp-prometheus-in pod indicate it's rejecting rhche_host.url and jaeger.url coming in from the filesystem-mounted openshift configmaps. The "/..2019_0626*/" component I think is messing things up. If that pathname component were not there, pcp pmdaprometheus would be fine with those urls.

Maybe something changed in the way openshift has started plopping those .url files into the pod filesystem? We may be able to hack on the pmdaprometheus python code to ignore the whole thing, by replacing line 937: name = file_split[0].replace(self.config_dir + "/", "").replace("/",".") with: name = os.path.basename(file_split)