superscriptjs / superscript

A dialogue engine for creating chat bots
MIT License
1.65k stars 209 forks source link

Performance issue (parsing and db import) #362

Open rensdewolf opened 7 years ago

rensdewolf commented 7 years ago

Having successfully tested and trialed the superscript engine with smaller samples we felt it was time to create a real "production" bot with something more to say. Our real bot comprises:

During the development and testing of the real bot the time to parse the topic files increased (as expected). Unfortunately the time required to import the JSON into mongodb also increased even more. On average it takes about 20 seconds now to have a new/clean instance of the real bot compared to a couple of seconds with our smaller samples. (An interesting thing I noticed is that parsing of many smaller topic files is faster than having all topics combined in a single large file. I assume they are all parsed individually (in parallel) and then combined in a single JSON file?)

Of course this is no problem once the real bot is finished and running in production, but making changes and testing them will become an issue due to the delays involved with parsing and importing. So the question is where you feel there could be an improvement in the code and if there are any known bottlenecks that I could address. Before going through the code I wanted to understand what you think is wise and feasible:

  1. One thing I noticed is that the parser results in a combined JSON file. This file is then read and processed by the db import function. Each item is then imported individually? Would it be possible to have the parser create the import data for mongodb so it can be imported as a single batch instead?

  2. Would it be possible to have an "in-memory" chatbot instead of having to import write everything to mongodb? I understand the engine requires mongodb at the moment, but would it be possible rewrite the db interface functions to work with an in-memory instance instead? (Of course the bot and its data is volatile and will be lost when stopped.)

Maybe you and others have some other suggestions that could increase performance. Happy to hear about them and discuss (or even implement).

bensalilijames commented 7 years ago

Hey @rensdewolf, thanks for the issue and getting a conversation going on performance!

I believe the current bottleneck is the normalisation step that we do on triggers when parsing .ss scripts. The triggers themselves need to turn from SuperScript syntax: * (tell me|crack) a joke * into a normal regex: (?=^|\\s)\\s*(tell me|crack)(?=\\s|$)\\s*a joke for runtime matching.

The step that takes the most time here (I think, but haven't profiled in a while!) is the stuff that goes on in the repo, which does some initial cleaning/normalisation. It's only 10s of milliseconds but it adds up pretty fast when you have 100+ triggers. I'm not sure how much there is to do on this front, but that is a profiling question really.

Regarding the importer, there is definitely work that could be done turning it into a bulk import to see a reasonably quick win. You'd have to be careful about conversations (need to import parent triggers before child triggers) but that definitely could be a good way to speed that step up!

I think the in-memory chatbot is possible, and would fit in with a design goal for v2 which is to allow swappable back-ends (e.g. Postgres, other SQL variants).

hailiang-wang commented 7 years ago

So, for the options of database, I suggest an approach of using leveldb since it is lightweight and fast, also has the different up db,

As the knowledge are persisted in leveldb already, it would be nice the history, topics and others things share the same strategy. Some other open source softwares like NobeBB are using leveldb as backend.

On the other side, I think the query model can be enhanced too if we can leverage lucene to analysis topics and history.

silentrob commented 7 years ago

I think we could see some wins if we force one topic per file and then hash the file contents and only re-import when the file hash has changed. This would help improve the authoring experience. And we could only re-generate/re-order the single topic.

rensdewolf commented 7 years ago

Thank you for the feedback. Hopefully I will find some time to implement and test the suggestions. I will let you know my experiences.