sosreport / sos

A unified tool for collecting system logs and other debug information
http://sos.rtfd.org
GNU General Public License v2.0
507 stars 543 forks source link

[tests] FullCleanTest.test_private_map_was_generated timeouts when running in container #3283

Open pmoravec opened 1 year ago

pmoravec commented 1 year ago

When running avocado tests in a container(*), this test easily timeouts despite it has 10 minutes timeout (https://github.com/sosreport/sos/blob/main/tests/cleaner_tests/full_report/full_report_run.py#L25).

The main cause is sos report takes 8 minutes (while subsequent clean is supposed to run a few times longer, so even 20 minutes timeout might not be sufficient). We can increase the timeout as a defensive resolution, but .. to what value? Also does it make sense to optimise the run somehow? Since the most lengthy plugins are:

[stdlog] 2023-06-21 10:36:36,750 avocado.utils.process DEBUG| [stdout] [plugin:process] collected plugin 'process' in 79.25696086883545
[stdlog] 2023-06-21 10:39:20,913 avocado.utils.process DEBUG| [stdout] [plugin:system] collected plugin 'system' in 97.09700441360474
[stdlog] 2023-06-21 10:37:35,111 avocado.utils.process DEBUG| [stdout] [plugin:processor] collected plugin 'processor' in 123.39306426048279
[stdlog] 2023-06-21 10:39:20,920 avocado.utils.process DEBUG| [stdout] [plugin:selinux] collected plugin 'selinux' in 148.71357417106628
[stdlog] 2023-06-21 10:38:41,883 avocado.utils.process DEBUG| [stdout] [plugin:cgroups] collected plugin 'cgroups' in 341.3810544013977

(*) I think the fact sos runs in container vastly contributes to the duration of all those plugins (esp. cgroups).

Does it make sense to call this sos with option e.g. --plugin-timeout 60 (or maybe 90)? For the sake of cleaner testing, we are not much interested in files like /sys/fs/cgroup/cpuacct/system.slice/sys-kernel-config.mount/tasks (collecting this file took over 2 seconds alone).

pmoravec commented 1 year ago

Optionally, we might have an env.variable (with current default) to customize the sos_timeout per an avocado run..? (but that does not answer my "too lengthy plugins" point).

arif-ali commented 1 year ago

That's interesting, most of my testing happens on a container on my laptop, not seen any timeout issues like this so far. Albeit it's an LXD container and not podman/buildah/docker

pmoravec commented 1 year ago

The "container blame" is just a theory as I dont exactly know the full environment where we noticed such timeouts. The lengthy plugins usually run much faster (esp. cgroups) and their execution time "scale up" with number of containers on the system, afaik.

TurboTurtle commented 1 year ago

How are these potentially problematic containers launched, exactly? Containers are in most respects the same as running on bare metal, so this kind of performance drop is surprising.

That being said, cgroups taking longer makes sense if there are dozens or even hundreds of containers running, as each container will create a lot of new collections in the cgroups plugin - same for openshift, crio, etc... if the container logs are requested. But the ones like system, selinux, and process are surprising to see.

pmoravec commented 1 year ago

We are still investigating this, but we can make tests/report_tests/options_tests/options_tests.py:OptionsFromConfigTest much faster in general by skipping many plugins (or enabling just those we have a particular test case).

https://github.com/sosreport/sos/pull/3288 raised for it.