sosreport / sos

A unified tool for collecting system logs and other debug information
http://sos.rtfd.org
GNU General Public License v2.0
512 stars 542 forks source link

Too many tailed files collected #3783

Open pmoravec opened 1 month ago

pmoravec commented 1 month ago

We noticed a high occurrence of tailing some specific files in different sosreports. Below is a list of the most often tailed files and my suggestion to that. Any comment / suggestion is welcomed. Possible options are "leave as is" or "increase sizelimit" or "drop that file or some data to truncate it".

jcastill commented 1 month ago
* sar/sa*.xml : we collect the files due to legacy reasons only (imho). I would vote for dropping them (until somebody needs them). If that isnt welcomed, let increase sizelimit - having incomplete/broken xml file is bit useless.

I'm not sure these files are needed at all, but instead of dropping we could add an option to collect them if needed, in case anyone relies on them for any scripts. "Interpreted/decoded" ones in plain text are more useful.

* various `var/log/*` files, namely `messages*` or `audit.log` or `secure` - probably let it be, maybe audits or secure should be collected for past X days instead of given filesize..?

Agreed, maybe two/three days should be enough by default, or even just one day.

* `logs/journalctl_--no-pager` - that is expected and reasonable, no action

Agreed

pmoravec commented 1 month ago
* `postgresql/var.lib.pgsql.data.log.postgresql-*.log` : this is most probably from Satellite / foreman systems with bigger postgres queries logged. Probably worth increasing the sizelimit, I will raise PR for it

This happens for Satellite / foreman, where we already increased sizelimit to 100MB via preset. And I confirm it is applied to these files. Raising it higher is possible, but.. not much worth of it. Usually, tailed files are from previous days only, that is sufficient.

haircommander commented 1 month ago

from my perspective as a node team member, crio and kubelet logs are the most important pieces for us to debug issues. We don't need them if they're caught in the overall journal though. Is bumping the size limit an option for those? or, we bump the size limit for the overall journal, and drop the crio/kubelet specfic journals. What do folks think?

TurboTurtle commented 1 month ago

I'd prefer increasing the size limit of unit-specific journals and/or log files over increasing the system journal collection. It gives us granularity without enforcing potentially very large system journal collections across the board. Granted, I get the point of "well it's going to be the majority of the system journal anyway...", but I think this is the least-bad option overall.

As far as the sar/sa files go, I'd defer to support teams on how often they're used. I know there's been a general shift away from sar but there's a lot of knowledge built around the use of these, at least the plaintext translations. I'd be open to dropping the binary collections since you need to use the same version to translate those as which generated them (hence why we do that during collection at all), but I'd be wary of dropping them entirely.

jcastill commented 1 month ago

The plain text ones are used a lot, even though they are not the most accurate output you could get... but as a first step when looking into performance issues, they are good enough. I've searched internally and I haven't found any reference to the xmls or any tool that may use them, but "absence of proof..." . I don't remember using them for any support case. I think there's an old tool, kSar, abandoned now, that used to read the xmls, but other than that nothing.

nrwahl2 commented 1 month ago

Pacemaker: It's been a couple of years since I've worked in support, so I would defer to any support engineers. Whether the limit is sufficient will always depend on how promptly the user opens a support ticket after an issue occurs, and on whether additional verbosity has been configured (it usually hasn't been).

We could increase the size limit to some arbitrary higher number. I don't know what fraction of sosreports have truncated Pacemaker log files currently and whether this would be worth doing.

Support engineers should not hesitate to request the full pacemaker.log file if the relevant timestamps are not present. Ideally, that should introduce only a small delay in investigation, though that depends on both the support team and the user.

pafernanr commented 1 month ago

Hello all,

+1 to remove sa*.xml files. They are redundant, binary saXX files are also included and they contain the full day dump. Some times also truncated, but not usual. It can happen if interval is too short.

I'd also like to suggest increasing the size limit to the foreman plugin. These CSV files are sometimes truncated which leads to missing important dynflow steps. Note that the plugin already limits the output to last 14 days, which should be enough for any support case. That said, although I fully agree a limit is mandatory, in this specific plugin, file limit is somehow "redundant". IMO increasing it to 150/200M could be a good choice to let the 14 days limit the output in as many cases as possible.

pmoravec commented 1 month ago

SAR data: I would drop the xml as rarely-if-at-all used (I am asking internally, either way), while I would keep the binary data (the "source of truth" that we can copy to another system with same sysstat version and get whatever we want) and also text saXX files (concise enough text interpretation of the binary data).

Increasing the 100M limit of foreman's dynflow* tables: no strong opinion. Can you @pafernanr evaluate the impact? I.e. generate so many foreman tasks to have 200M data in each such table, and compare execution time and tarball size for sizelimits of 100MB, 150MB and 200MB? On one side, we would get some more history of tasks. On the other side, the data are already ordered by time so most recent is always present, and I am on torns if it is worth paying the extra cost in longer time and tarball size to get that info. This sizelimit affected my own investigation of foreman/Satellite support cases only rarely, hence my reluctant attitude. But if others hit it more often, no objections.

pmoravec commented 1 month ago

SAR: Feedback from two groups of support engineers in Red Hat: "we dont use XML format, but we heavily use binary saXX and text sarXX formats". So I would vote for dropping the xml format (and a reference in release notes - so maybe worth waiting for 4.8.2 tag to mention it in "more major" 4.9 RN?)