Open arif-ali opened 1 week ago
I can reproduce it on upstream sos
:
# PYTHONPATH=tests/ avocado run -p TESTLOCAL=true --test-runner=runner -t stagetwo tests/report_tests/plugin_tests/logs.py
JOB ID : a6e312d650f35bed6cc6bb6110c893c60c92b956
JOB LOG : /root/avocado/job-results/job-2024-11-26T11.19-a6e312d/job.log
(1/3) tests/report_tests/plugin_tests/logs.py:JournalSizeLimitTest.test_journal_size_limit: PASS (5.63 s)
(2/3) tests/report_tests/plugin_tests/logs.py:JournalSizeLimitTest.test_journal_tailed_and_linked: FAIL: /var/tmp/avocado_af3k3wvgsos_tests.py.JournalSizeLimitTest/sosreport-JournalSizeLimitTest/sosreport-pmoravec-rhel9-2024-11-26-ckscbzp/sos_strings/logs/journalctl_--no-pager.tailed does not exist (0.02 s)
(3/3) tests/report_tests/plugin_tests/logs.py:JournalSizeLimitTest.test_archive_created: PASS (0.01 s)
RESULTS : PASS 2 | ERROR 0 | FAIL 1 | SKIP 0 | WARN 0 | INTERRUPT 0 | CANCEL 0
JOB TIME : 6.24 s
#
I will try to dig into it..
I think the problem may be that the log is not always big enough to be tailed. I ran the tests in a RHEL9 machine and it succeeded, with a tailed file of 20M:
$ ll sos_strings/logs/
total 40960
-rw-r--r--. 1 root root 20969983 Nov 26 11:14 journalctl_--no-pager_--boot.tailed
-rw-r--r--. 1 root root 20969983 Nov 26 11:14 journalctl_--no-pager.tailed
But in the logs that @arif-ali pasted, the files seem to be smaller than that. @pmoravec how big were the files on your side?
Maybe the runs are too quick on GCE that we're not catching it any more. I've run the same stagetwo
tests locally 6 times on a fresh VM every single time, and I can't reproduce the problem either.
So, maybe we need some kind of wait for this particular test for the test to be successful?
In my test, I am doing some automation, and creating a fresh new VM every time. So, 6 fresh VMs
Indeed, in my case the journal is small:
# journalctl --no-pager | wc
34466 339490 4699809
# journalctl --disk-usage
Archived and active journals take up 24.1M in the file system.
#
(it's RHEL9)
Also the different size of journal indirectly explains the randomness in job success/failure.
I cannot reproduce with 24.10 or Fedora 41 either. @pmoravec what OS did you reproduce this on? And what do you get with 'journalctl --disk-usage'? And the problem here is that if your journal was small already, it should have attempted to make it bigger via:
https://github.com/sosreport/sos/blob/main/tests/report_tests/plugin_tests/logs.py#L62
sosfd = journal.stream('sos-testing')
rsize = 10 * 1048576
for _ in range(2):
# generate 10MB, write it, then write it in reverse.
# Spend less time generating new strings
rand = ''.join(
random.choice(ascii_uppercase + digits) for _ in range(rsize))
sosfd.write(rand + '\n')
# sleep to avoid burst rate-limiting
sleep(5)
sosfd.write(rand[::-1] + '\n')
I see where the problem is but dont have an idea how to reliably reproduce. Since after rebooting my system, the test succeeds - after it generated journal logs (which didnt happen before the reboot!). The problem is/was here:
# journalctl --no-pager | wc
34466 339490 4699809
# journalctl --disk-usage
Archived and active journals take up 24.1M in the file system.
#
The later is tested by sos
:
from systemd import journal # pylint: disable=import-error
_reader = journal.Reader()
_size = _reader.get_usage() / 1024 / 1024
if _size > 20:
return
As size was >20MB, we did not generate the extra logs. While collected journalctl --no-pager
was shorter.
But how that is/was possible..?
I dont know why or how, but after playing with journalctl --vacuum-size=10M
and similar just to purge some logs and rebooting the RHEL9 system (no persistent journal storage), I do see now:
# journalctl --disk-usage ; journalctl | wc
Archived and active journals take up 8.0M in the file system.
1966 23768 288005
#
The journal logs contain a regular boot logs and almost nothing else. I think something similar happens in the CI tests..
While I am unable to have a reliable reproducer, it seems there are ways of having journal.Reader().get_usage()
over 20MB but the journal logs being smaller. Is it worth replacing the if _size > 20:
test by calling journalctl --no-pager | wc -c
-like test?
I think that may work. The problem here may be that Reader() gets the size of all journal files (maybe even rotated ones?) and I imagine it contains some metadata, and perhaps pre-allocated space for the journal, while 'journalctl' I imagine provides just text, formatted for human consumption. But I'll need to investigate this more to see if that's the case or not.
There's been a few occurances now, and have found that the test below failing, and on restart of the job, it works as normal
debug.log from avocodo
The contents of the logs folder
We can see that the file never got created, and hence the test failed. Needs further investigation on what is going on here