medic / cht-sync

Data synchronization between CouchDB and PostgreSQL for the purpose of analytics.
GNU General Public License v3.0
4 stars 5 forks source link

Increase cht-sync throughput #80

Closed njuguna-n closed 7 months ago

njuguna-n commented 7 months ago

Cht-sync is currently running aginst the https://brac-clone-for-phil.dev.medicmobile.org instance but syncing of documents is currently very slow. When I checked a few hours ago it had a backlog of 41,168,339 documents to be synced (see screenshot below).

Image

njuguna-n commented 7 months ago

I looked into increasing PostgREST connections but after reading the documentation I decided against it and looked into logstash performance instead. Pipeline stats for the active logstash pipeline showed a flow rate of about 9 docs per second so looking into how to increase this. There are two main levers to pull to increase performance i.e. worker count and batch size. I have tried adding an environment variable called PIPELINE_WORKERS and setting it to 100 then restarting the logstash container but that did not increase the actual worker count which remained at 4. I will continue with the performance tuning and investigations tomorrow.

CC: @lorerod in case you are able to look into this during your work day.

Image

lorerod commented 7 months ago

@njuguna-n do you have a PR for this? or a branch?

njuguna-n commented 7 months ago

@lorerod no specific branch or PR yet. I was playing around with existing config to see what works.

njuguna-n commented 7 months ago

Using redis has increased throughput to around 1000 docs per second with 250 logstash workers. I'll try adding more worker to see if that can be improved. Currently data is not being copied over to postgres due to an error in the worker I created. Debugging that currently. Draft PR of current work here.

Image

Image

njuguna-n commented 7 months ago

The issue was that redis and the worker were not on the same Docker network. Replication is now happening but the speed has gone down but is still ten times faster than before, see screenshot below. I'll leave this to run as is over the weekend and cleanup my draft PR if it holds up well.

Image

njuguna-n commented 7 months ago

There was an issue with logstash logs becoming too big and filling disk space on the server leading to failures. After fixing that and restarting we now get errors from postgres about insufficient resources. We'll have to sort that out before continuing with this test. The good news is that the redis solution seems to be working well and the throughput went up to around 250 docs per second which is good enough for now. We can try and test the upper limits of that later when doing load testing.

image

njuguna-n commented 7 months ago

The database issue was fixed with a reboot, but I'm still not sure about the root cause of the issue. Restarting this process today and throughput was again at around 1000 documents per second but the logstash container fails with the error below after about an hour.

Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "Ruby-0-Thread-6@puma srv threadpool reaper: /usr/share/logstash/vendor/bundle/jruby/3.1.0/gems/puma-6.4.0-java/lib/puma/thread_pool.rb:345"
[2024-04-15T07:31:00,380][FATAL][org.logstash.Logstash    ] uncaught error (in thread Ruby-0-Thread-1: /usr/share/logstash/vendor/bundle/jruby/3.1.0/gems/logstash-output-elasticsearch-11.19.0-java/lib/logstash/outputs/elasticsearch/http_client/pool.rb:230)
java.lang.OutOfMemoryError: Java heap space
    at java.lang.Long.valueOf(Long.java:1211) ~[?:?]
    at org.jruby.RubyThread$SleepTask2.run(RubyThread.java:1698) ~[jruby.jar:?]
    at org.jruby.RubyThread$SleepTask2.run(RubyThread.java:1682) ~[jruby.jar:?]
    at org.jruby.RubyThread.executeTask(RubyThread.java:1751) ~[jruby.jar:?]
    at org.jruby.RubyThread.executeTaskBlocking(RubyThread.java:1725) ~[jruby.jar:?]
    at org.jruby.RubyThread.sleep(RubyThread.java:1599) ~[jruby.jar:?]
    at org.jruby.RubyKernel.sleep(RubyKernel.java:739) ~[jruby.jar:?]
    at org.jruby.RubyKernel$INVOKER$s$0$1$sleep.call(RubyKernel$INVOKER$s$0$1$sleep.gen) ~[jruby.jar:?]
    at org.jruby.internal.runtime.methods.JavaMethod$JavaMethodN.call(JavaMethod.java:825) ~[jruby.jar:?]
    at org.jruby.internal.runtime.methods.DynamicMethod.call(DynamicMethod.java:220) ~[jruby.jar:?]
    at org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:242) ~[jruby.jar:?]
    at org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:61) ~[jruby.jar:?]
    at org.jruby.ir.interpreter.InterpreterEngine.processCall(InterpreterEngine.java:301) ~[jruby.jar:?]
    at org.jruby.ir.interpreter.StartupInterpreterEngine.interpret(StartupInterpreterEngine.java:66) ~[jruby.jar:?]
    at org.jruby.internal.runtime.methods.MixedModeIRMethod.INTERPRET_METHOD(MixedModeIRMethod.java:128) ~[jruby.jar:?]
    at org.jruby.internal.runtime.methods.MixedModeIRMethod.call(MixedModeIRMethod.java:115) ~[jruby.jar:?]
    at org.jruby.runtime.callsite.CachingCallSite.cacheAndCall(CachingCallSite.java:452) ~[jruby.jar:?]
    at org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:92) ~[jruby.jar:?]
    at org.jruby.runtime.callsite.CachingCallSite.callIter(CachingCallSite.java:103) ~[jruby.jar:?]
    at org.jruby.ir.instructions.CallBase.interpret(CallBase.java:558) ~[jruby.jar:?]
    at org.jruby.ir.interpreter.InterpreterEngine.processCall(InterpreterEngine.java:367) ~[jruby.jar:?]
    at org.jruby.ir.interpreter.StartupInterpreterEngine.interpret(StartupInterpreterEngine.java:66) ~[jruby.jar:?]
    at org.jruby.ir.interpreter.Interpreter.INTERPRET_BLOCK(Interpreter.java:116) ~[jruby.jar:?]
    at org.jruby.runtime.MixedModeIRBlockBody.commonYieldPath(MixedModeIRBlockBody.java:136) ~[jruby.jar:?]
    at org.jruby.runtime.IRBlockBody.call(IRBlockBody.java:66) ~[jruby.jar:?]
    at org.jruby.runtime.IRBlockBody.call(IRBlockBody.java:58) ~[jruby.jar:?]
    at org.jruby.runtime.Block.call(Block.java:143) ~[jruby.jar:?]
    at org.jruby.RubyProc.call(RubyProc.java:352) ~[jruby.jar:?]
    at org.jruby.internal.runtime.RubyRunnable.run(RubyRunnable.java:110) ~[jruby.jar:?]
    at java.lang.Thread.run(Thread.java:840) [?:?]
andrablaj commented 7 months ago

In case the Java Heap Space Error persists, I found this article with some possible solutions for configuring Logstash.

njuguna-n commented 7 months ago

Reducing the number of workers from 200 to 100 seems to have worked in keeping the logstash container stable. The current throughput rate is 438 docs per second and expecting all docs to be synced in about 20 hours if it remains stable 🤞

Image

medic-ci commented 2 months ago

:tada: This issue has been resolved in version 1.0.0 :tada:

The release is available on GitHub release

Your semantic-release bot :package::rocket: