Performance issue (parsing and db import)

rensdewolf commented 7 years ago

Having successfully tested and trialed the superscript engine with smaller samples we felt it was time to create a real "production" bot with something more to say. Our real bot comprises:

Approximately 20 topics (including pre and post) with up to 50 gambit per topic. Each gambit has several replies and may be part of a conversation (up to 10 levels deep).
Approximately 10 concepts with 20 to 50 items per concept.
Several plugin files with various custom functions.

During the development and testing of the real bot the time to parse the topic files increased (as expected). Unfortunately the time required to import the JSON into mongodb also increased even more. On average it takes about 20 seconds now to have a new/clean instance of the real bot compared to a couple of seconds with our smaller samples. (An interesting thing I noticed is that parsing of many smaller topic files is faster than having all topics combined in a single large file. I assume they are all parsed individually (in parallel) and then combined in a single JSON file?)

Of course this is no problem once the real bot is finished and running in production, but making changes and testing them will become an issue due to the delays involved with parsing and importing. So the question is where you feel there could be an improvement in the code and if there are any known bottlenecks that I could address. Before going through the code I wanted to understand what you think is wise and feasible:

One thing I noticed is that the parser results in a combined JSON file. This file is then read and processed by the db import function. Each item is then imported individually? Would it be possible to have the parser create the import data for mongodb so it can be imported as a single batch instead?
Would it be possible to have an "in-memory" chatbot instead of having to import write everything to mongodb? I understand the engine requires mongodb at the moment, but would it be possible rewrite the db interface functions to work with an in-memory instance instead? (Of course the bot and its data is volatile and will be lost when stopped.)

Maybe you and others have some other suggestions that could increase performance. Happy to hear about them and discuss (or even implement).

bensalilijames commented 7 years ago

Hey @rensdewolf, thanks for the issue and getting a conversation going on performance!

I believe the current bottleneck is the normalisation step that we do on triggers when parsing .ss scripts. The triggers themselves need to turn from SuperScript syntax: * (tell me|crack) a joke * into a normal regex: (?=^|\\s)\\s*(tell me|crack)(?=\\s|$)\\s*a joke for runtime matching.

The step that takes the most time here (I think, but haven't profiled in a while!) is the stuff that goes on in the https://github.com/bot-ai/bot-lang repo, which does some initial cleaning/normalisation. It's only 10s of milliseconds but it adds up pretty fast when you have 100+ triggers. I'm not sure how much there is to do on this front, but that is a profiling question really.

Regarding the importer, there is definitely work that could be done turning it into a bulk import to see a reasonably quick win. You'd have to be careful about conversations (need to import parent triggers before child triggers) but that definitely could be a good way to speed that step up!

I think the in-memory chatbot is possible, and would fit in with a design goal for v2 which is to allow swappable back-ends (e.g. Postgres, other SQL variants).

hailiang-wang commented 7 years ago

So, for the options of database, I suggest an approach of using leveldb since it is lightweight and fast, also has the different up db, https://github.com/Level/levelup/wiki/Modules.

As the knowledge are persisted in leveldb already, it would be nice the history, topics and others things share the same strategy. Some other open source softwares like NobeBB are using leveldb as backend.

On the other side, I think the query model can be enhanced too if we can leverage lucene to analysis topics and history.

silentrob commented 7 years ago

I think we could see some wins if we force one topic per file and then hash the file contents and only re-import when the file hash has changed. This would help improve the authoring experience. And we could only re-generate/re-order the single topic.

rensdewolf commented 7 years ago

Thank you for the feedback. Hopefully I will find some time to implement and test the suggestions. I will let you know my experiences.

superscriptjs / superscript

Performance issue (parsing and db import) #362