Closed avtikhon closed 3 years ago
@avtikhon asked me a suggestion how to better track tarantool processes in test-run, so I created a simple sampling infrastructure for test-run as example: 6e73ac470322c458eed08f58f74fad1fdf8c9d99.
Not tested thoroughly, but at least I looked over logs for different types of tests and it seems that it tracks everything that I would expect.
Sadly, the register-unregister logs are not written to test/var/log/
due to #247, but we can print them to the terminal for debugging purposes this way:
diff --git a/lib/sampler.py b/lib/sampler.py
index 8bcc6ca..95dcec0 100644
--- a/lib/sampler.py
+++ b/lib/sampler.py
@@ -2,7 +2,7 @@ import os
import sys
import time
-from lib.colorer import color_log
+from lib.colorer import color_stdout as color_log
from lib.colorer import qa_notice
from lib.utils import format_process
Using 'WorkerCurrentTask' queue it saves the initial RSS value of the used worker when task started to run.
It is the result queue message, not the queue name.
It is the result queue message, not the queue name.
Right, corrected all the same places in commit messages.
I'll describe my complains briefly to unblock you. Sorry for the non-detailed comments.
First of all, the commit message does not correspond the implementation of ResourcesWatcher: at least it does not look at WorkerCurrentTask.
Next, I don't like how the code is organized. Sorry, but it either not logical or can be organized better:
If we'll eliminate the ResourceWatcher, the collected resource consumption statistics will be accessible anyway: we can hold it in the sampler. The test duration statistics may be hold in the StatisticsWatcher.
Some sampling code and fields are in SamplerWatcher. It should not be so. The responsibility area of SamplerWatcher is to integrate Sampler into our event loop (which spins around select()): all those process_result() / process_timeout(), tracking of a last sampler wakeup time. Everything else should be performed by sampler.
There is the _sample_process()
method, which is appropriate place to read /proc/<pid>/status
. It is called for each process from _sample()
, which is appropriate place to accumulate, transform and save metrics. The result of _sample()
call is update of a sampler field with accumulated statistics in a form useful for reporting (say, self._rss_summary
with per-test maximums). There is no need to track alive test list with corresponsing processes, all information is already available from self.process
.
It is undesirable to depend on test-run internals in sampler.py, so I guess some external procedure (StatisticsWatcher.print_statistics()
?) should call color_stdout() and write to appropriate log file. Sampler should just provide all information in a convenient form (say, Sampler.rss_summary()
).
self.started_time = 0.0
Not sure it is good idea. If it is possible to reference this value, so the duration will be ~1.6 * 10^9 seconds.
How it works with reruns? Please, clarify.
I would use the term 'duration' rather than 'timing'.
timing = round(time.time() - self.started_time, 2)
I would round it at printing, not when it is collected.
AFAIR, we agreed to mark long tests (long_run in suite.ini) in the statistics output.
- Some sampling code and fields are in SamplerWatcher. It should not be so. The responsibility area of SamplerWatcher is to integrate Sampler into our event loop (which spins around select()): all those process_result() / process_timeout(), tracking of a last sampler wakeup time. Everything else should be performed by sampler.
My bad, I wrongly read the diff and had thought that collect_rss()
is in SimplerWatcher. Sorry.
However there is (unused?) self.rss_procs_results
new field in SamplerWatcher. And Sampler.collect_rss()
should be integrated into _sample_process()
/ _sample()
(see also the 3rd point above).
2/3 Add RSS statistics collecting
...
3/3 Add tests timings collecting
...
Common
...
Alexander, thanks a lot for a deep explanation and offline help. I've made all of your suggestions. Please check the new updated version of the patchset. Currently the output of results looks like the following example
...
======================================================================================
WORKR TEST PARAMS RESULT
---------------------------------------------------------------------------------
[001] box/huge_field_map.test.lua [ pass ]
[003] box/huge_field_map_long.test.lua [ pass ]
---------------------------------------------------------------------------------
Up to 10 most RSS used tasks in Mb:
* 77.4 box/huge_field_map_long.test.lua (long)
* 66.9 box/huge_field_map.test.lua
-----------------------------------------
Up to 10 most long tasks in seconds:
* 1.08 box/huge_field_map_long.test.lua (long)
* 0.03 box/huge_field_map.test.lua
-----------------------------------------
Statistics:
* pass: 2
The patchset is in the good shape, but I would spin for a while around code readability. It is better to keep it good from scratch, rather than fix afterwards. I left my comments above.
Pushed several fixups (mostly around naming and wording). Please, apply if you don't object against them. The fixup commits arranged to ease squashing.
Pushed several fixups (mostly around naming and wording). Please, apply if you don't object against them. The fixup commits arranged to ease squashing.
All patches LGTM, squashed it, thank you.
Updated the test-run submodule in tarantool in the following commits: 2.9.0-35-g213f480e7, 2.8.1-19-g67ff7cc5f, 2.7.2-15-gc35655113, 1.10.10-8-gfb804dcca.
Patch set consists of 3 commits:
Track tarantool and unit test executables that are run using test-run with metainformation: worker, test, test configuration and server name.
Add a function that will be called each 0.1 second for each tracked process.
The implementation tracks non-default servers and re-register default servers that executes several tests ('core = tarantool' case).
Part of #277
Found that some tests may fail due to lack of memory. Mostly it happens in CI on remote hosts. To be able to collect memory used statistic decided to add RSS memory status collecting routine get_proc_stat_rss() which parses files:
for RSS value 'VmRSS' which is size of memory portions. It contains the three following parts (VmRSS = RssAnon + RssFile + RssShmem) 1:
Decided that the best way for CI not to run this RSS collecting routine for each sent command from tests tasks, but to run it after the test task started each 0.1 second delay, to collect its maximum RSS value during task run. This delay used to run routines in 'SamplerWatcher' listener. Also found that delay of 0.1 sec is completely enough to catch RSS use increase, due to tested check:
Which checked that 100 Mb of data allocated in seconds:
The main idea is to check all test depend processes running at some point in time and choose maximum RSS reached value by it. For it used '_sample_process()' routine which gets RSS for each currently alive process and '_sample()' routine which counts sum of each task alive processes RSS and checks if this value is bigger than previously saved for the current task. Both routines are in 'Sampler()' class which is called by 'process_timeout()' routine from 'SamplerWatcher' listener.
Also used print_statistics() routine in listener 'StatisticsWatcher' which prints statistics to stdout after testing. It used to print RSS usage for failed tasks and up to 10 most used it tasks. Created new subdirectory 'statistics' in 'vardir' path to save statistics files. The current patch uses it to save there 'rss.log' file with RSS values per tested tasks in format:
Closes #277
Decided to collect tests run durations in standalone file and print to stdout after testing finished. To stdout printing durations for failed tasks and up to 10 most long tasks.
For durations collecting used listener 'StatisticsWatcher' which has the following used routines:
Closes #286