CPU load when device has large number of attributes

andygotz commented 6 years ago

An issue has been reported by the SKA consortium when running a Tango device server with hundreds of attributes on a small CPU - the CPU load goes up to 100% for their CPU (a Taskit Stamp9G20, SPFC) when one or more external Tango clients connect to this server. This issue could be related to the way attributes are accessed in the Tango library. This issue will be used to try to reproduce the CPU load problem and to fix it if it is in the library.

Attached are some plots (produced with nmon) provided by Isak Theron (SKA): and the source code used for the device servers used to demonstrate the issue. spfcPerformance_200.tar.gz pyPerformance_200.tar.gz

andygotz commented 6 years ago

I have run some profiling on the Spf C++ server example (see attachment above) and the first results show that a lot of calls are made and time spent in iterating over attributes. The main culprit seems to be the function Tango::MultiAttribute::get_attr_ind_by_name(char const*). There are also many calls to tolower() (3rd most called function!). Most of the time is spent in libzmq which I do not understand as this example does not trigger events as far as I can see and atkpanel is reading via polling. Tests were run on a laptop with Ubuntu 17.10 and 1 client (atkpanel) connected to the device. Polling was not configured.

Output from callgrind_annotate attached. Spf_callgrind.log

andygotz commented 6 years ago

After playing around a bit with valgrind's callgrind I think I found the proof that the cpu bottleneck is in get_attr_by_name() as seen in this output from kcachegrind: To get this output you need to run the device server as follows: valgrind --tool=callgrind --callgrind-out-file=callgrind-cpu.out bin/Spf test and look out the output using kcachegrind: kcachegrind callgrind-cpu.out I tried google's perftools as well but they were not so useful because of the multi threading of the device server.

bourtemb commented 6 years ago

Thank you Andy for the time spent on this issue. What you are writing matches what I was writing on the kernel mailing list. I think we should rewrite this get_attr_by_name() method and use a hashmap instead of a vector. This will have the advantage to reduce drastically the number of calls to tolower() method. Of course this will have some consequences in several locations of the code but should greatly improve the performances in the case where there are many attributes.

bourtemb commented 6 years ago

The main problem will be to maintain the compatibility since many MultiAttribute public methods are relying on an attribute list which is supposed to be a vector<Attribute *>:

vector<Attribute *> &get_attribute_list()
long get_attr_ind_by_name(const char *attr_name)
WAttribute &get_w_attr_by_ind(const long ind)
Attribute &get_attr_by_ind(const long ind) So we still need to be able to look for attributes by index and to return a vector<Attribute *>&

andygotz commented 6 years ago

Aha. Are these methods public? I presume so. Maybe we can deprecate them? At least internally we need to have a way of finding an attribute which is does not depend on the number of attributes. Another function which is overly used is tolower() - it is called 320 times more often than get_attr_by_name() which means it is typically called millions of times. I wonder if this is not being on the same strings repeatedly?

t-b commented 6 years ago

MultiAttribute::get_attr_by_name uses std::find_if and WantedAttr to find the attribute. And WantedAttr calls std::tolower not once for the attribute to be searched but for each attribute the server has. That can obviously be optimized.

bourtemb commented 6 years ago

Yes, you are right @andygotz and @t-b. We already identified that yesterday with @taurel. Calling tolower() only once before calling WantedAttr already improves the situation and reduces drastically the number of calls to tolower(). I have done some tests with a device server having 200 attributes named with similar names as the Spf device server given by SKA (which is actually one of the worst cases because the attribute names are with the same prefix and almost all their names have the same length). The attributes were polled every 2 seconds. I have used callgrind over 1 minute. With the current version, I got tolower called over 6 millions times and in a modified version where tolower is called once in get_attr_by_name() and no longer called on WantedAttr, tolower got called about 224 000 times. In the first case, the inclusive cost for get_attr_by_name() was 81.19% whereas it was 71.53% in the improved version. I would personally be in favour of using an unordered_map (C++11) when trying to get the reference of an attribute by name (this is done in many locations of the code, in push_xxx_event() methods for instance, leading to the issue reported by @gscalamera on #244). We could still keep the vector too for the moment since many users are probably using the get_attribute_list(), get_attr_ind_by_name(), ... public methods (I personally have already used these methods in some device servers). Of course this will increase the memory usage, but these are collections storing pointers to Attribute objects, so it is not "too big" and it should significantly improve the performances when there are many attributes.

bourtemb commented 6 years ago

I've done some tests using an unordered_map and for the same use case as in my previous comment (200 attributes polled at 2s during a minute), get_attr_by_name() inclusive_cost goes from 81.19% to 8.10% so I think it is definitely worth using maps. To be continued next week... ;-)

andygotz commented 6 years ago

Excellent! I look forward to testing the new version. It raises the question of if strings are the only way to identify attributes. Maybe we should offer an attribute ID which can be recovered once and used by the server to replace the get_attr_by_name() in most places. Just a thought.

bourtemb commented 6 years ago

With an ordered map (C++ standard map), I get an inclusive cost of 11.66% for get_attr_by_name() in the same conditions as described in my previous comments.

t-b commented 6 years ago

@bourtemb Are you already requiring C++11 compiler support to compile tango9?

bourtemb commented 6 years ago

@t-b Excellent question! For Tango 10, I would reply definitely yes... For Tango 9, I think we need to discuss that a bit further with the community. Before the move to github and cmake, there were already some parts of C++11 code in Tango 9, but they were always surrounded by some ifdef and compiled only when using a recent version of gcc supporting the C++11 feature used in this code section. There was always an alternative section in the code which was not using the C++ 11 features in order to support older gcc versions. We could continue that way if the community requires it (we may need to fix some parts of the code introduced after the move to github/cmake, mainly in the tests). Of course, this is more work for the maintenance of the kernel code but could also give more time to the users to migrate their software and recompile it with more recent versions of gcc. I think we need input from the community on this point and to define clearly what compiler versions should be supported by Tango 9 LTS. I just did a test with an older compiler (gcc 4.4.5, coming with Debian Squeeze (6)), and it appears the way the CMake files are currently implemented in cppTango, -std=c++11 option is always passed to gcc, which makes it currently impossible to compile with older compilers.

bourtemb commented 6 years ago

I'd like to add that it is currently possible to compile cppTango on Debian Wheezy (7) which was released almost 5 years ago (May 2013). This requires the installation of a more recent version of cmake manually, as we are doing it in our Travis CI tests.

bourtemb commented 6 years ago

You can find a modified version using a map or unordered_map on my fork:

Feel free to use these 2 versions in your tests and give us some feedback. The version using unordered_map should give us better performances. The version using map could bring some advantages in the future if we want to iterate through attributes names following the alphabetical order.

andygotz commented 6 years ago

Excellent - thank you very much for this quick fix! I hope to try it asap.

andygotz commented 6 years ago

Once I was on the patched branch I really noticed the difference - well done @bourtemb ! No more calls to tolower() and get_attr_by_name() is now 0.19% of the inclusive execution time (in mode debug). The main function taking 10% of the cpu is still get_attr_ind_by_name() because it loops through all the attributes (see screenshot). It is called mainly from read_attributes_from_cache() and read_attributes_no_except(). How can this be optimised? screenshot from 2018-01-13 08-59-27 This patch should go a long way to helping SKA to run Tango on the Spf controller (see description at start of thread)

andygotz commented 6 years ago

I just had an idea how to optimise get_attr_ind_by_name() to be independent of the number of attributes - store the index as part of the unordered map and use the map to find the index. This will avoid looping over the attributes and comparing strings. Do you think it is worth trying?

juandesant commented 6 years ago

Thanks everyone for looking into this… Do you think, @andygotz, @bourtemb, that having guidelines for objects with massive lists of attributes —i.e., don't start them with the same name, try to make them different length would also help, or would be irrelevant with this patch?

juandesant commented 6 years ago

Also, it would probably be good to have some testing with a similarly pathological long list of same-length, very similar attribute names, and another one with simply a very long list of attributes, so that you can do regression testing of the issue.

bourtemb commented 6 years ago

Do you think, @andygotz, @bourtemb, that having guidelines for objects with massive lists of attributes —i.e., don't start them with the same name, try to make them different length would also help, or would be irrelevant with this patch?

I don't think we need such guidelines. I think the patch should solve this performance issue. It would be great if you could try it out in your case. You can take a look at the following branches on my fork:

Also, it would probably be good to have some testing with a similarly pathological long list of same-length, very similar attribute names, and another one with simply a very long list of attributes, so that you can do regression testing of the issue.

Sure. Ideally we should add a test for each new PR. Of course, this is also very time consuming. We already have a device server allowing to reproduce the use case and to create as many (dynamic) attributes having similar names (or not) as we want: https://gitlab.esrf.fr/bourtemb/DynastarDS

bourtemb commented 6 years ago

This problem should be solved in Tango 9 LTS with PR #430

Ingvord commented 6 years ago

Would be useful to release a new binary to bintray-debian aka 9.2.7, would not be?

bourtemb commented 6 years ago

Sure! I actually don't know yet how to do that but I would be happy to learn. I think we should bump the library version number to 9.2.7 first. It looks like this was forgotten. It would be good to add some release notes as you did on the master branch too, at least describing what was changed after 9.2.5a. There are still some issues I'd like to solve before having an official release (generate a static library, fix flags used during the compilation).

andygotz commented 6 years ago

I agree with @Ingvord a binary release would be useful for testing by the community. I also agree with @bourtemb we need to up the version and maybe add a RC to it to indicate it is a release candidate before it becomes labelled as stable.

CorF4 commented 6 years ago

Dear all, I have a device server with 80 attributes. When I set it up with 100 device instances and run it, I see a high CPU consumption at the start-up (~80%) and then after init at still 40-50%. I poll almost all attributes with an interval of 1-5s. I notice that this seems to influence the clients...If I open an atkpanel for a random instance the attributes will every few seconds just show gray boxes instead of their values. With a taurus-gui the same happens plus it outputs: desc = The polling thread is late and discard this object polling. Advice: Tune device server polling origin = PollThread::err_out_of_sync reason = API_PollThreadOutOfSync

This seems to be correlated with the CPU consumption. The less instances I setup, the lower the CPU consumption and this weird behaviour from the clients does not appear.

So, I have already updated my tango version to the 9-lts and hoped the issue would be resolved but in fact I cannot find a difference to before w.r.t. the high CPU consumption (I hope I updated the installed version correctly, I followed the install instructions here https://github.com/tango-controls/cppTango/blob/tango-9-lts/INSTALL.md without the docker and replaced the libtango.so with the newly generated one). Have you ever seen such a behaviour? Have you tested such a setup with multiple device instances? Thanks a lot!

bourtemb commented 6 years ago

Hi, Just to be sure I understand well your use case. You are running a device server instance, which is exporting 100 devices? Is this correct? If this is the case, I think what you are observing is expected. When a device server instance is exporting several devices, a pool of threads is dedicated to the polling. By default, only one thread will be dedicated to the polling for all the devices exported by your device server instance. You can change that and configure this pool of threads from jive or astor using the polling threads manager to add more threads and define which thread will be responsible for which device. For instance, in jive, you can start the polling threads manager

jive-threads_polling_manager

threads_polling_manager-default

threads_polling_manager-3threads

Please refer to the following documentation to learn how to configure the polling thread pool: http://tango-controls.readthedocs.io/en/latest/development/device-api/device-polling.html?highlight=polling#configuring-the-polling-threads-pool In the case you were describing, if you didn't change it, only one thread was doing the polling for all your devices so that thread didn't have enough CPU time to read all your polled attributes and to respect all the polling periods you defined, this is why Tango is sending these errors to warn you that your device polling configuration is not correct for what you want to achieve.

CorF4 commented 6 years ago

Thanks a lot! Yes, that was exactly the problem. I added more threads to the pool and indeed the CPU usage comes down as does the error not appear anymore. So, thanks and sorry for spamming this thread.

bourtemb commented 6 years ago

Solved with the combination of PR #430 and #458

tango-controls / cppTango

CPU load when device has large number of attributes #424