linux-test-project / ltp

Linux Test Project (mailing list: https://lists.linux.it/listinfo/ltp)
https://linux-test-project.readthedocs.io/
GNU General Public License v2.0
2.28k stars 1k forks source link

memcontrol02 creates false positives #932

Open paulgortmaker opened 2 years ago

paulgortmaker commented 2 years ago

In reference to this recently added test:

commit 3d4ce5ad75bb74f73dc73031c27e9e0997718703 Author: Richard Palethorpe rpalethorpe@suse.com Date: Tue Dec 14 07:46:14 2021 +0000

cgroup: Add memcontrol02

I have the following results from a test run; the results, for ext2, ext3, ext4, vfat, tmpfs: memcontrol02.c:116: TPASS: Expect: (memory.current=65011712) ~= (memory.stat.file=54067200) memcontrol02.c:116: TPASS: Expect: (memory.current=65011712) ~= (memory.stat.file=54067200) memcontrol02.c:116: TPASS: Expect: (memory.current=62914560) ~= (memory.stat.file=51904512) memcontrol02.c:116: TFAIL: Expect: (memory.current=69206016) ~= (memory.stat.file=51904512) memcontrol02.c:116: TPASS: Expect: (memory.current=54525952) ~= (memory.stat.file=51904512)

I ended up looking at the test due to this failure and find that somewhat arbitrary/empirical values are used for file_to_all_error - 10% for some filesystems and 50% for others.

I suspect these chosen limits probably aren't sufficient to cover whether various debugging/trace options are enabled. There are no filesystem or cgroup changes in this instance - a generic x86-64 kernel. But we do run a lot of debug/test config settings in conjunction with our LTP testing.

I am not sure how you can make the memory use more accurate, but at least if there is a report here then others can find it and add to it.

richiejp commented 2 years ago

Yes, perhaps you have some slub_debug options enabled or KASAN?

In any case, IMO, this should be changed to an inequality by default i.e. memory.current > memory.stat_file and perhaps memory.current < "Total memory". We need solid bounds so that a test failure can be reported to the kernel devs without first tracing the memory allocations (for reference see the linked commit msg) to ensure it is an accountancy issue.

paulgortmaker commented 2 years ago

Sorry for delayed reply - haven't logged into github for a while. I'm pretty sure it isn't unique to our test team in that they enable a bunch of CONFIG_DEBUG options in order to cover as much ground at once without doing multiple build+boot+test sequences. So yes, you can assume this isn't a production .config with all debug options disabled.

Maybe if /proc/config.gz is available then various tests can scale values based on some relevant/related settings within, vs. just picking a quasi arbitrary value?

richiejp commented 2 years ago

I'm pretty sure it isn't unique to our test team in that they enable a bunch of CONFIG_DEBUG

We also do it, but only on a subset of systems and tests. Some SUTs won't boot in a reasonable time if we enable debugging. Many tests also fail and there is a backlog for fixing them.

Maybe if /proc/config.gz is available then various tests can scale values based on some relevant/related settings within, vs. just picking a quasi arbitrary value?

To be frank it's too complex for LTP, there are too many variables (including arch, FS, kernel version etc.) and they interact in non-linear ways. With testing things like MM which are not specified by the ABI and are probabilistic instead of deterministic we always run into the same issues. We basically need to model the "ideal" behavior of the system and decide how close to the model reality should be. We don't have any infrastructure in LTP to handle this. Probably it's better done in performance testing frameworks (e.g. https://github.com/gormanm/mmtests which has found accountancy issues).

Otherwise we will just end up playing whackamole as we find new variables that break our assumptions. OTOH I think we can find some extreme lower and upper bounds and check against those. These tests were copied from the selftests, which in the future I would modify to be far more relaxed.

paulgortmaker commented 2 years ago

Thanks again for the reply. As I said originally, I realize these kind of tests are hard to lock down into a PASS/FAIL categorization, as compared to a locking fail test. And I can see config.gz parsing being too complicated.

The only other thing I can think of where there are tests like this with a grey area between pass and fail is introduce a "WARN" result - e.g. <10%=PASS ; 10%-15%=WARN ; >15%=FAIL. There would be a couple advantages here - if kernel bloat starts to push something out of PASS range, we'll get visibility into that hopefully months in advance and not just a failure out of the blue when the straw broke the camel's back. The other advantage would be that the test would be more self-evident that it was established on what were reasonable thresholds at the time for a "typical" config.

I believe both our testers and myself assumed this test was an intermittent hard failure and had no idea it was more advisory in nature until I went and read the test. I am by no means an LTP expert, so apologies if this exists already, but it would be good if there was a list of tests that are not clear cut PASS/FAIL results, such as a test checking that 2+2 returns 4.

richiejp commented 2 years ago

Unfortunately many tests are like this and it's not clear if a fail is serious without a thorough investigation. There is a WARN error code already, but it's still open to interpretation. I suppose we could add some test meta-data to indicate that a test is 'advisory' or empirical in nature. Then print some info on failures which will speed up or prioritise investigations.

On a side note: I think the proper solution here is to parameterize the tests, so that the tester can check for regressions against past data and/or a model. Essentially the same as performance testing. The problem is that if we simply add some parameters to the test, then no one will know how to use them. We try to avoid parameters like the plague because they make everything require an expert. The test needs to provide some meta-data that can be consumed by a testing framework and combined with past data and a model to create meaningful expectations. (@metan-ucw)