Rule::unify_source method execution takes about 30% of ForwardChainer execution time

vsbogd commented 4 years ago

I profiled ForwardChainer to find out how performance can be improved and found that it spends significant amount of time in the Rule::unify_source method. It depends on rules complexity. The more complex rules are used the bigger this amount of time is. The following benchmark demonstrates this: https://github.com/opencog/benchmark/pull/27

The problem is that unify_source does almost nothing. First part converts rule's BindLink into another one with random variable naming (RewriteLink::alpha_convert). And second one constructs new BindLink with some variables grounded.

Proper fix probably should be raised to opencog/atomspace, but I am raising the issue here as main concern is URE performance.

Flame graph of the code benchmarked (better opening in web-browser): perf.svg.gz

Following steps require getting perf and FlameGraph tool (https://github.com/brendangregg/FlameGraph). Intel CPU with support of Intel Processor Trace is required as well.

Steps to reproduce: Run benchmark in the background:

./micro/benchmark --benchmark_filter=ForwardChainer_Basic --benchmark_min_time=600 &

Collect profile using Intel Processor Trace. You need only 1 second of profile it will be too large otherwise:

perf record -e intel_pt//u -p `pidof benchmark` sleep 1

Build flame graph of the do_step call (it can take a lot of time):

perf script --itrace=i100nsg | ./FlameGraph/stackcollapse-perf.pl | fgrep 'benchmark;BM_ForwardChainer_Basic;opencog::ForwardChainer::do_chain;opencog::ForwardChainer::do_step_rec;opencog::ForwardChainer::do_step;' | ./FlameGraph/flamegraph.pl > perf.svg

linas commented 4 years ago

Not to hijack the conversation, but the flame-graph also points at opencog::SourceSet which looses huge amounts of time in boost which calls operator== which calls pthread stuff. Over the last decade, I've repeatedly discovered that boost is a performance nightmare -- rip it out of your code, everything will run 2x or 3x faster. I mean, it seems odd to me that pthread-anything should be more than 1% of the total performance; that indicates thundering herds in operator==

Also: what's __GI__exp and __ieee7? These seem to be eating lots of time...

Also: as a general rule, you want to run any/all atomspace things with HUGETLB_MORECORE=yes LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libhugetlbfs.so.0 as this will in general give you a 50% performance boost.

linas commented 4 years ago

If you haven't configed hugepages, see below; its cut-n-paste from atomspace-postgres; you have to adjust these instructions for your app:

Tuning HugePages

Using 2MB-sized HugePages can also offer a large performance boost, both for Postgres, and for the atomspace, especially when working with a large atomspace. The procedure to set these up is a bit complicated. First, add a hugepages user-group, and add postgres to it:

   sudo groupadd hugepages
   sudo gpasswd -a postgres hugepages

Then you need to find out what the group id (gid) was:

   id postgres

Suppose that this shows group id 1234 for hugepages. This needs to be added to /etc/sysctl.conf as well. So edit, and add:

   vm.nr_hugepages = 16384       # 32GB of hugepages, 25% of RAM.
   vm.hugetlb_shm_group=1234

Don't forget to sudo sysctl -p /etc/sysctl.conf again. Finally, the ability to use those pages. Add to /etc/security/limits.conf:

    @hugepages      soft    memlock         unlimited
    @hugepages      hard    memlock         unlimited

linas commented 4 years ago

actually, you may as well set hugepages to 80% of RAM, or whatever the typical working-set size for the atomspace is.

vsbogd commented 4 years ago

the flame-graph also points at opencog::SourceSet which looses huge amounts of time in boost which calls operator== which calls pthread stuff

Linas thank you for spotting this. I believe pthread stuff is a mutex lock at: https://github.com/opencog/ure/blob/0a353dd99b26ae70ed6543b755055319298b1252/opencog/ure/forwardchainer/SourceSet.cc#L37-L41 so root cause is not boost but intensive locking.

Also: what's GIexp and __ieee7? These seem to be eating lots of time...

I believe it is exponent calculation here: https://github.com/opencog/ure/blob/0a353dd99b26ae70ed6543b755055319298b1252/opencog/ure/forwardchainer/SourceSet.cc#L175-L188

linas commented 4 years ago

Hmm. Given that exp is fast (should be less than 50-100 clock cycles) the fact that its showing up a lot means it's being called a lot. Probably too much. Similar remarks for the mutex -- it should be no more than dozens of clocks, unless there's a huge amount of lock contention...

ngeiswei commented 4 years ago

Thanks. Great findings!

Regarding the locks, these are just crude half-way steps towards parallel URE. Once we move to C++17 I'll use more refined (unique/shared) mutexes. Maybe some thread pool, etc, not sure yet.

Regarding exp, it is used to calculate the probability distribution over sources, obviously this should be cached.

opencog / ure

Rule::unify_source method execution takes about 30% of ForwardChainer execution time #54

If you haven't configed hugepages, see below; its cut-n-paste from atomspace-postgres; you have to adjust these instructions for your app:

Tuning HugePages