jemc / jylis

A distributed in-memory database for Conflict-free Replicated Data Types (CRDTs). :seedling: :left_right_arrow:
https://jemc.github.io/jylis
Mozilla Public License 2.0
76 stars 6 forks source link

TLOG experiences performance issues with large recordset #9

Open amclain opened 6 years ago

amclain commented 6 years ago

After restarting a single-node Jylis server with disk persistence, GETting a large TLOG key (>100k items) causes Jylis resource starvation, CPU goes to 100%, Jylis can't process other queries, and the executing query doesn't seem to return. This query took about 1000ms before the restart. A 100 item limited GET succeeds, as does adding new items with INS.

Jylis at rest after starting up and ingesting the TLOG from disk:

image

Jylis after sending the TLOG GET:

image

I tried easing up the item count and it had interesting results. I did this to try to keep a batch of items "hot" in memory to see if it would speed up the next query. 10k items worked fine. Then I bumped up the count to 20k and that was fine. I worked up to 80k items this way before Jylis became excessively unresponsive.

Then I restarted Jylis and the web app and tried a 45k item batch. It took 57 seconds to return from the database, which was close to the 60 second app timeout limit. I pulled the same 45k batch again and Jylis responded in 431ms. If I wait for several minutes with the processes still running, the long request happens again, followed by shorter requests after that. I'm not sure if this is intentionally being cached in Jylis or if it's a side effect of the garbage collector, but I thought I should point out the behavior.

Additional Notes

Connection.current.write ["TLOG", "GET", "temperature"]
result = Connection.current.read

I realize this use of Jylis (>100k items in a TLOG) may be abusive to Jylis' ideal use case and pushing it beyond its intended limits. The intent of this test was to find the breaking point of Jylis. If that's the case, I have no problem trimming the TLOG at this point (or setting a limit until the cursor is implemented). However, if you would like to make Jylis performant in this scenario, I can certainly share the TLOG data so that the issue can be reproduced.

I'm a little concerned that the extreme delays started happening after a deploy (a new Jylis image may have been pulled in addition to the service restarting). Not sure if this is a regression or if I just got lucky from running Jylis for so long without a restart.

This also exposes a couple issues that I am concerned about that are not related to a large data set: One running query seems to tie up the Jylis connection. No other queries can be made and no clients can connect when a query is running.