sosreport / sos

A unified tool for collecting system logs and other debug information
http://sos.rtfd.org
GNU General Public License v2.0
514 stars 541 forks source link

[tests] 24.10 Stage Two tests failing for logs #3853

Open arif-ali opened 1 week ago

arif-ali commented 1 week ago

There's been a few occurances now, and have found that the test below failing, and on restart of the job, it works as normal

066-tests_report_tests_plugin_tests_logs.py_JournalSizeLimitTest.test_journal_tailed_and_linked

debug.log from avocodo

[stdlog] 2024-11-26 08:43:06,719 avocado.test test             L0313 INFO | INIT 1-tests/report_tests/plugin_tests/logs.py:JournalSizeLimitTest.test_journal_tailed_and_linked
[stdlog] 2024-11-26 08:43:06,721 avocado.test parameters       L0141 DEBUG| PARAMS (key=timeout, path=*, default=None) => None
[stdlog] 2024-11-26 08:43:06,721 avocado.test parameters       L0141 DEBUG| PARAMS (key=timeout_factor, path=*, default=1.0) => 1.0
[stdlog] 2024-11-26 08:43:06,721 avocado.test test             L0345 DEBUG| Test metadata:
[stdlog] 2024-11-26 08:43:06,721 avocado.test test             L0347 DEBUG|   filename: /tmp/cirrus-ci-build/tests/report_tests/plugin_tests/logs.py
[stdlog] 2024-11-26 08:43:06,721 avocado.test test             L0353 DEBUG|   teststmpdir: /var/tmp/avocado_30pafaq7
[stdlog] 2024-11-26 08:43:06,721 avocado.test test             L0354 DEBUG|   original timeout: None
[stdlog] 2024-11-26 08:43:06,721 avocado.test test             L0355 DEBUG|   timeout factor: 1.0
[stdlog] 2024-11-26 08:43:06,721 avocado.test test             L0356 DEBUG|   actual timeout: None
[stdlog] 2024-11-26 08:43:06,722 avocado.test test             L0548 INFO | START 1-tests/report_tests/plugin_tests/logs.py:JournalSizeLimitTest.test_journal_tailed_and_linked
[stdlog] 2024-11-26 08:43:06,724 avocado.test stacktrace       L0040 ERROR| 
[stdlog] 2024-11-26 08:43:06,725 avocado.test stacktrace       L0042 ERROR| Reproduced traceback from: /usr/local/lib/python3.12/dist-packages/avocado/core/test.py:638
[stdlog] 2024-11-26 08:43:06,726 avocado.test stacktrace       L0049 ERROR| Traceback (most recent call last):
[stdlog] 2024-11-26 08:43:06,726 avocado.test stacktrace       L0049 ERROR|   File "/tmp/cirrus-ci-build/tests/report_tests/plugin_tests/logs.py", line 85, in test_journal_tailed_and_linked
[stdlog] 2024-11-26 08:43:06,726 avocado.test stacktrace       L0049 ERROR|     self.assertFileExists(tailed)
[stdlog] 2024-11-26 08:43:06,726 avocado.test stacktrace       L0049 ERROR|   File "/tmp/cirrus-ci-build/tests/sos_tests.py", line 355, in assertFileExists
[stdlog] 2024-11-26 08:43:06,726 avocado.test stacktrace       L0049 ERROR|     assert os.path.exists(fname), f"{fname} does not exist"
[stdlog] 2024-11-26 08:43:06,726 avocado.test stacktrace       L0049 ERROR|            ^^^^^^^^^^^^^^^^^^^^^
[stdlog] 2024-11-26 08:43:06,727 avocado.test stacktrace       L0049 ERROR| AssertionError: /var/tmp/avocado_30pafaq7sos_tests.py.JournalSizeLimitTest/sosreport-JournalSizeLimitTest/sosrepor
t-cirrus-task-5545356358844416-2024-11-26-tsuiwhi/sos_strings/logs/journalctl_--no-pager.tailed does not exist
[stdlog] 2024-11-26 08:43:06,727 avocado.test stacktrace       L0050 ERROR| 
[stdlog] 2024-11-26 08:43:06,727 avocado.test test             L0642 DEBUG| Local variables:
[stdlog] 2024-11-26 08:43:06,746 avocado.test test             L0645 DEBUG|  -> self <class 'logs.JournalSizeLimitTest'>: 1-tests/report_tests/plugin_tests/logs.py:JournalSizeLimitTest.test_
journal_tailed_and_linked
[stdlog] 2024-11-26 08:43:06,746 avocado.test test             L0645 DEBUG|  -> tailed <class 'str'>: /var/tmp/avocado_30pafaq7sos_tests.py.JournalSizeLimitTest/sosreport-JournalSizeLimitTes
t/sosreport-cirrus-task-5545356358844416-2024-11-26-tsuiwhi/sos_strings/logs/journalctl_--no-pager.tailed
[stdlog] 2024-11-26 08:43:06,746 avocado.test test             L0740 ERROR| FAIL 1-tests/report_tests/plugin_tests/logs.py:JournalSizeLimitTest.test_journal_tailed_and_linked -> AssertionErr
or: /var/tmp/avocado_30pafaq7sos_tests.py.JournalSizeLimitTest/sosreport-JournalSizeLimitTest/sosreport-cirrus-task-5545356358844416-2024-11-26-tsuiwhi/sos_strings/logs/journalctl_--no-pager
.tailed does not exist
[stdlog] 2024-11-26 08:43:06,746 avocado.test test             L0733 INFO | 

The contents of the logs folder

total 1336
drwx------ 2 arif arif   4096 Nov 26 08:43 .
drwxr-xr-x 3 arif arif   4096 Nov 26 08:43 ..
-rw-r--r-- 1 arif arif     61 Nov 26 08:43 journalctl_--disk-usage
-rw-r--r-- 1 arif arif 668618 Nov 26 08:43 journalctl_--no-pager
-rw-r--r-- 1 arif arif 602485 Nov 26 08:43 journalctl_--no-pager_--boot
-rw-r--r-- 1 arif arif  66089 Nov 26 08:43 journalctl_--no-pager_--boot_-1
-rw-r--r-- 1 arif arif   4280 Nov 26 08:43 ls_-alZR_.var.log

We can see that the file never got created, and hence the test failed. Needs further investigation on what is going on here

pmoravec commented 1 week ago

I can reproduce it on upstream sos:

# PYTHONPATH=tests/ avocado run -p TESTLOCAL=true --test-runner=runner -t stagetwo tests/report_tests/plugin_tests/logs.py
JOB ID     : a6e312d650f35bed6cc6bb6110c893c60c92b956
JOB LOG    : /root/avocado/job-results/job-2024-11-26T11.19-a6e312d/job.log
 (1/3) tests/report_tests/plugin_tests/logs.py:JournalSizeLimitTest.test_journal_size_limit: PASS (5.63 s)
 (2/3) tests/report_tests/plugin_tests/logs.py:JournalSizeLimitTest.test_journal_tailed_and_linked: FAIL: /var/tmp/avocado_af3k3wvgsos_tests.py.JournalSizeLimitTest/sosreport-JournalSizeLimitTest/sosreport-pmoravec-rhel9-2024-11-26-ckscbzp/sos_strings/logs/journalctl_--no-pager.tailed does not exist (0.02 s)
 (3/3) tests/report_tests/plugin_tests/logs.py:JournalSizeLimitTest.test_archive_created: PASS (0.01 s)
RESULTS    : PASS 2 | ERROR 0 | FAIL 1 | SKIP 0 | WARN 0 | INTERRUPT 0 | CANCEL 0
JOB TIME   : 6.24 s
#

I will try to dig into it..

jcastill commented 1 week ago

I think the problem may be that the log is not always big enough to be tailed. I ran the tests in a RHEL9 machine and it succeeded, with a tailed file of 20M:

$ ll sos_strings/logs/ 
total 40960
-rw-r--r--. 1 root root 20969983 Nov 26 11:14 journalctl_--no-pager_--boot.tailed
-rw-r--r--. 1 root root 20969983 Nov 26 11:14 journalctl_--no-pager.tailed

But in the logs that @arif-ali pasted, the files seem to be smaller than that. @pmoravec how big were the files on your side?

arif-ali commented 1 week ago

Maybe the runs are too quick on GCE that we're not catching it any more. I've run the same stagetwo tests locally 6 times on a fresh VM every single time, and I can't reproduce the problem either.

So, maybe we need some kind of wait for this particular test for the test to be successful?

In my test, I am doing some automation, and creating a fresh new VM every time. So, 6 fresh VMs

pmoravec commented 1 week ago

Indeed, in my case the journal is small:

# journalctl --no-pager | wc
  34466  339490 4699809
# journalctl --disk-usage
Archived and active journals take up 24.1M in the file system.
# 

(it's RHEL9)

Also the different size of journal indirectly explains the randomness in job success/failure.

jcastill commented 1 week ago

I cannot reproduce with 24.10 or Fedora 41 either. @pmoravec what OS did you reproduce this on? And what do you get with 'journalctl --disk-usage'? And the problem here is that if your journal was small already, it should have attempted to make it bigger via:

https://github.com/sosreport/sos/blob/main/tests/report_tests/plugin_tests/logs.py#L62

        sosfd = journal.stream('sos-testing')
        rsize = 10 * 1048576
        for _ in range(2):
            # generate 10MB, write it, then write it in reverse.
            # Spend less time generating new strings
            rand = ''.join(
                random.choice(ascii_uppercase + digits) for _ in range(rsize))
            sosfd.write(rand + '\n')
            # sleep to avoid burst rate-limiting
            sleep(5)
            sosfd.write(rand[::-1] + '\n')
pmoravec commented 1 week ago

I see where the problem is but dont have an idea how to reliably reproduce. Since after rebooting my system, the test succeeds - after it generated journal logs (which didnt happen before the reboot!). The problem is/was here:

# journalctl --no-pager | wc
  34466  339490 4699809
# journalctl --disk-usage
Archived and active journals take up 24.1M in the file system.
# 

The later is tested by sos:

    from systemd import journal  # pylint: disable=import-error
    _reader = journal.Reader()
    _size = _reader.get_usage() / 1024 / 1024
    if _size > 20:
        return

As size was >20MB, we did not generate the extra logs. While collected journalctl --no-pager was shorter.

But how that is/was possible..?

pmoravec commented 1 week ago

I dont know why or how, but after playing with journalctl --vacuum-size=10M and similar just to purge some logs and rebooting the RHEL9 system (no persistent journal storage), I do see now:

# journalctl --disk-usage ; journalctl | wc
Archived and active journals take up 8.0M in the file system.
   1966   23768  288005
# 

The journal logs contain a regular boot logs and almost nothing else. I think something similar happens in the CI tests..

pmoravec commented 1 week ago

While I am unable to have a reliable reproducer, it seems there are ways of having journal.Reader().get_usage() over 20MB but the journal logs being smaller. Is it worth replacing the if _size > 20: test by calling journalctl --no-pager | wc -c-like test?

jcastill commented 1 week ago

I think that may work. The problem here may be that Reader() gets the size of all journal files (maybe even rotated ones?) and I imagine it contains some metadata, and perhaps pre-allocated space for the journal, while 'journalctl' I imagine provides just text, formatted for human consumption. But I'll need to investigate this more to see if that's the case or not.