Open vsbogd opened 4 years ago
Not to hijack the conversation, but the flame-graph also points at opencog::SourceSet
which looses huge amounts of time in boost which calls operator==
which calls pthread stuff. Over the last decade, I've repeatedly discovered that boost is a performance nightmare -- rip it out of your code, everything will run 2x or 3x faster. I mean, it seems odd to me that pthread-anything should be more than 1% of the total performance; that indicates thundering herds in operator==
Also: what's __GI__exp
and __ieee7
? These seem to be eating lots of time...
Also: as a general rule, you want to run any/all atomspace things with HUGETLB_MORECORE=yes LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libhugetlbfs.so.0
as this will in general give you a 50% performance boost.
Using 2MB-sized HugePages can also offer a large performance boost, both for Postgres, and for the atomspace, especially when working with a large atomspace. The procedure to set these up is a bit complicated. First, add a hugepages user-group, and add postgres to it:
sudo groupadd hugepages
sudo gpasswd -a postgres hugepages
Then you need to find out what the group id (gid) was:
id postgres
Suppose that this shows group id 1234 for hugepages
. This needs to
be added to /etc/sysctl.conf
as well. So edit, and add:
vm.nr_hugepages = 16384 # 32GB of hugepages, 25% of RAM.
vm.hugetlb_shm_group=1234
Don't forget to sudo sysctl -p /etc/sysctl.conf
again.
Finally, the ability to use those pages. Add to /etc/security/limits.conf
:
@hugepages soft memlock unlimited
@hugepages hard memlock unlimited
actually, you may as well set hugepages to 80% of RAM, or whatever the typical working-set size for the atomspace is.
the flame-graph also points at opencog::SourceSet which looses huge amounts of time in boost which calls operator== which calls pthread stuff
Linas thank you for spotting this. I believe pthread stuff is a mutex lock at: https://github.com/opencog/ure/blob/0a353dd99b26ae70ed6543b755055319298b1252/opencog/ure/forwardchainer/SourceSet.cc#L37-L41 so root cause is not boost but intensive locking.
Also: what's GIexp and __ieee7? These seem to be eating lots of time...
I believe it is exponent calculation here: https://github.com/opencog/ure/blob/0a353dd99b26ae70ed6543b755055319298b1252/opencog/ure/forwardchainer/SourceSet.cc#L175-L188
Hmm. Given that exp
is fast (should be less than 50-100 clock cycles) the fact that its showing up a lot means it's being called a lot. Probably too much. Similar remarks for the mutex -- it should be no more than dozens of clocks, unless there's a huge amount of lock contention...
Thanks. Great findings!
Regarding the locks, these are just crude half-way steps towards parallel URE. Once we move to C++17 I'll use more refined (unique/shared) mutexes. Maybe some thread pool, etc, not sure yet.
Regarding exp
, it is used to calculate the probability distribution over sources, obviously this should be cached.
I profiled ForwardChainer to find out how performance can be improved and found that it spends significant amount of time in the Rule::unify_source method. It depends on rules complexity. The more complex rules are used the bigger this amount of time is. The following benchmark demonstrates this: https://github.com/opencog/benchmark/pull/27
The problem is that
unify_source
does almost nothing. First part converts rule'sBindLink
into another one with random variable naming (RewriteLink::alpha_convert
). And second one constructs newBindLink
with some variables grounded.Proper fix probably should be raised to
opencog/atomspace
, but I am raising the issue here as main concern is URE performance.Flame graph of the code benchmarked (better opening in web-browser): perf.svg.gz
Following steps require getting
perf
andFlameGraph
tool (https://github.com/brendangregg/FlameGraph). Intel CPU with support of Intel Processor Trace is required as well.Steps to reproduce: Run benchmark in the background:
Collect profile using Intel Processor Trace. You need only 1 second of profile it will be too large otherwise:
Build flame graph of the
do_step
call (it can take a lot of time):