[atomspace] LargeFlatUTest fails (async_buffer overfull set)

Radivarig commented 4 years ago

Updated to:

atomspace 1adc208abb925b7bc168be4af47bf73e6f096d4d
cogutil 887eac16689bf3b94195d64abe7823115dab1431

This prints out indefinitely for LargeFlatUTest: [DEBUG] async_buffer overfull set; had to sleep 2649 millisecs to drain!

Radivarig commented 4 years ago

Same for LargeZipfUTest. Both tests log this before as well: [INFO] Bulk store to database with max UUID=1

linas commented 4 years ago

OK, So first, these two tests do run significantly longer than any others (maybe I should scale them back). On my system, the relative timings are:

        Start  49: BasicSaveUTest
 49/135 Test  #49: BasicSaveUTest ...................   Passed    3.04 sec
        Start  50: ValueSaveUTest
 50/135 Test  #50: ValueSaveUTest ...................   Passed    4.59 sec
        Start  51: PersistUTest
 51/135 Test  #51: PersistUTest .....................   Passed    1.48 sec
        Start  52: DeleteUTest
 52/135 Test  #52: DeleteUTest ......................   Passed    3.40 sec
        Start  53: FetchUTest
 53/135 Test  #53: FetchUTest .......................   Passed    2.49 sec
        Start  54: MultiPersistUTest
 54/135 Test  #54: MultiPersistUTest ................   Passed    1.69 sec
        Start  55: MultiUserUTest
 55/135 Test  #55: MultiUserUTest ...................   Passed    4.79 sec
        Start  56: LargeFlatUTest
 56/135 Test  #56: LargeFlatUTest ...................   Passed  225.69 sec
        Start  57: LargeZipfUTest
 57/135 Test  #57: LargeZipfUTest ...................   Passed  285.84 sec

Note that they take 50x 100x longer than the other tests.

While these are running, postgres devotes 1 to 6 threads running from 70% cpu to 30% cpu (each) the unit test itself is varying from 1% to 60% cpu, but usually low. So this is very bottlenecked on postgres. (I added these tests not because I thnk postgres is in danger; rather, I was working on a distributed hash table version, which is flakey as hell, and can't deal with this kind of data. It occurred to be safety minded and backport to the atomspace.).

Getting back to postgres: its actual performance is highly susceptible to tuning. There's a number of parameters one can play with and opencog/persist/sql/README.md reviews key ones. Also, SSD disks can often make a huge difference (I don't know if/how they'd make a difference for these tests)

I did not look to see if my log filled with these messages, but they are not a surprise: the unit test must wait for postgres to finish. So perhaps this is "not a bug" .... maybe you just have to be patient ...

linas commented 4 years ago

Also you can manually reduce the size of these tests: you can edit them, grep for NCOPIES and make it smaller. (Its NWORDS and NPAIRS in the other) I don't want to change the defaults, at least not yet, unless everyone complains. CircleCI seems to deal with it. And my box is 5+ years old, so these tests are faster for most people ....

Radivarig commented 4 years ago

I should not have assumed it was indefinitely, altho my patience was 300 seconds.

These are the execution times:

...
        Start  48: AtomSpaceUtilsUTest
 48/127 Test  #48: AtomSpaceUtilsUTest ..............   Passed    0.03 sec
        Start  49: BasicSaveUTest
 49/127 Test  #49: BasicSaveUTest ...................   Passed    1.89 sec
        Start  50: ValueSaveUTest
 50/127 Test  #50: ValueSaveUTest ...................   Passed    2.42 sec
        Start  51: PersistUTest
 51/127 Test  #51: PersistUTest .....................   Passed    0.78 sec
        Start  52: DeleteUTest
 52/127 Test  #52: DeleteUTest ......................   Passed    1.81 sec
        Start  53: FetchUTest
 53/127 Test  #53: FetchUTest .......................   Passed    1.04 sec
        Start  54: MultiPersistUTest
 54/127 Test  #54: MultiPersistUTest ................   Passed    1.22 sec
        Start  55: MultiUserUTest
 55/127 Test  #55: MultiUserUTest ...................   Passed    3.33 sec
        Start  56: LargeFlatUTest
 56/127 Test  #56: LargeFlatUTest ...................   Passed  1127.55 sec
        Start  57: LargeZipfUTest
 57/127 Test  #57: LargeZipfUTest ...................   Passed  1385.08 sec
        Start  58: BasicSCMUTest
 58/127 Test  #58: BasicSCMUTest ....................   Passed    2.65 sec
...

I'm running on these specs:

λ lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
Address sizes:       39 bits physical, 48 bits virtual
CPU(s):              8
On-line CPU(s) list: 0-7
Thread(s) per core:  2
Core(s) per socket:  4
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               158
Model name:          Intel(R) Core(TM) i7-8705G CPU @ 3.10GHz
Stepping:            9
CPU MHz:             898.058
CPU max MHz:         4100.0000
CPU min MHz:         800.0000
BogoMIPS:            6192.00
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            8192K
NUMA node0 CPU(s):   0-7

Parallel checking is disabled as mentioned in this issue #34.

linas commented 4 years ago

well, that is on the long side. Again, postgres is not usually CPU-bound; having a fast CPU does little -- it's I/O bound, writing to disk. (and other I/O bottlenecks: cache-to-RAM, user-kernel, RAM--to-DMA (PCIe), PCIe-to-SATA) The circleci unit tests run in 164 seconds; they have a CPU comparable or slower than yours.

Cut-n-paste from the tuning README:

Performance tweaks

The Postgres default configuration can be (and should be) tweaked for performance. The performance will be disastrously slow if the database is not tuned. The primary performance bottleneck is the default of synchronous commits during writing. On spinning disk drives, this can lead to delays of tens of milliseconds to write handfuls of atoms, as that is the seek time (latency) for spinning disk media. Thus, synchronous commits should be disabled.

Solid state disks are a lot faster; it's not clear if this would still be a bottleneck.

Disabling synchronous commits may cause the latest database updates to be lost, if power goes out, or the system unexpectedly reboots. This kind of loss is usually not a problem for most opencog apps, ... all of the current apps are gathering statistics from generated data, and so this kind of loss is almost surely inconsequential.

Edit postgresql.conf (a typical location is /etc/postgresql/9.5/main/postgresql.conf) and make the changes below. The first two changes are recommended by http://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server

   shared_buffers = default was 32MB, change to 25% of installed RAM
   work_mem = default was 1MB change to 32MB
   effective_cache_size = default was 128MB, change to 50%-75% of installed RAM
   synchronous_commit = default on change to off
   wal_buffers = default 64kB change to 2MB or even 32MB
   checkpoint_segments = 64 (each one takes up 16MB disk space)
   max_connections = 130 (each opencog instance needs 32)
   max_worker_processes = 32 (one per CPU core)
   ssl = off

For SSD drives, the following should help:

  seq_page_cost = 0.1
  random_page_cost = 0.15
  effective_io_concurrency = 20

For spinning media, synchronous_commit=off is 120x faster than synchronous_commit=local(about 400 atoms/sec vs 3 atoms/sec)

For write-mostly databases, such as in the language-learning project, you will get better results with checkpoint_segments = 100.

If you have postgres 9.0 or newer, there are no checkpoint_segments. Instead, do this:

   checkpoint_timeout = 1h
   max_wal_size = 8GB
   checkpoint_completion_target = 0.9

This will avoid the "checkpoints are occurring too frequently" warning message.

Restarting the server might lead to errors stating that max shared mem usage has been exceeded. This can be fixed by telling the kernel to use 6.4 gigabytes (for example):

   vi /etc/sysctl.conf
   kernel.shmmax = 6440100100

save file contents, then:

   sysctl -p /etc/sysctl.conf

linas commented 4 years ago

All new, all improved README file:

Performance tweaks

The Postgres default configuration can be (and should be) tweaked for performance. The performance will be disastrously slow if the database is not tuned. The primary performance bottleneck is the default of synchronous commits during writing. On spinning disk drives, this can lead to delays of tens of milliseconds to write handfuls of atoms, as that is the seek time (latency) for spinning disk media. Thus, synchronous commits should be disabled.

The difference between synchronous_commit=off and =on can be as much as a factor of 100x for spinning disks, and a factor of 5x for SSD drives, based on measurements on actual AtomSpace workloads.

Edit postgresql.conf (a typical location is /etc/postgresql/9.6/main/postgresql.conf) and make the changes below. The first two changes are recommended by the PostgreSQL wiki

   shared_buffers = 24GB       # Change to 25% of installed RAM
   work_mem = 32MB             # Default was 1MB, change to 32MB
   effective_cache_size = 60GB # Change to 50%-75% of installed RAM
   synchronous_commit = off    # Default was on, change to off
   max_connections = 130       # Each AtomSpace instance needs 32
   max_worker_processes = 32   # One per CPU core
   ssl = off                   # There's no point to encyrption locally

Avoid the "checkpoints are occurring too frequently" warning message by setting:

   checkpoint_timeout = 1h
   max_wal_size = 8GB
   checkpoint_completion_target = 0.9

For SSD drives, the following can make a significant difference. There's some benchmark charts at a blog. The changes to seq_page_cost and random_page_cost represent the relative ratios of SSD speed to CPU/RAM speed (where 1.0 is for a spinning disk.)

  seq_page_cost = 0.1
  random_page_cost = 0.1
  effective_io_concurrency = 100

Restarting the server might lead to errors stating that max shared mem usage has been exceeded. This can be fixed by telling the kernel to use 16.4 gigabytes (for example). Edit: sudo vi /etc/sysctl.conf and add:

   kernel.shmmax = 16440100100

save file contents, then:

   sudo sysctl -p /etc/sysctl.conf

Using 2MB-sized HugePages also helps. The proceedure here is a bit complicated. Add a hugepages user-group, and add postgres to it:

   sudo groupadd hugepages
   sudo gpasswd -a postgres hugepages

Then you need to find out what the group id (gid) was:

   id postgres

Suppose that this shows group id 1234 for hugepages. This needs to be added to /etc/sysctl.conf as well. So edit, and add:

   vm.nr_hugepages = 16384       # 32GB of hugepages, 25% of RAM.
   vm.hugetlb_shm_group=1234

Don't forget to sudo sysctl -p /etc/sysctl.conf again.

Finally, the ability to use thos pages. Add to /etc/security/limits.conf:

    @hugepages      soft    memlock         unlimited
    @hugepages      hard    memlock         unlimited

linas commented 4 years ago

After following my own tuning suggestions, I shaved off about 10% to 20% in execution time:

        Start  56: LargeFlatUTest
 56/135 Test  #56: LargeFlatUTest ...................   Passed  207.30 sec
        Start  57: LargeZipfUTest
 57/135 Test  #57: LargeZipfUTest ...................   Passed  237.48 sec

Radivarig commented 4 years ago

Great! With only adding above parameters to $PGDATA/postgresql.conf I'm getting these times:

        Start  56: LargeFlatUTest
 56/127 Test  #56: LargeFlatUTest ...................   Passed  120.99 sec
        Start  57: LargeZipfUTest
 57/127 Test  #57: LargeZipfUTest ...................   Passed  142.62 sec

opencog / opencog-nix

[atomspace] LargeFlatUTest fails (async_buffer overfull set) #37

Performance tweaks

Performance tweaks