Refactoring LDAModel

LDAModel.js over 1200 lines long and continues to grow. It's got tons of methods and every method has tons of side effects. For me at least, this makes it really hard to reason about what it going on inside of it. I think it would benefit hugely from some refactoring. After a year or so of trying to figure out how best to refactor LDAModel, I have some thoughts on what the new LDAModel might look like.

New class structure

My main observation is that LDAModel already has a data structure that it maintains, but doesn't properly utilize. So, if we created more classes to fill out this data structure, we would end up with a structure that is easier to reason about, build on, test, and maintain. I have created the following diagram to broadly show how I would lay these classes out. LDAModel_Refactor

It's not quite UML (my editing software didn't have the correct symbols) but each of the arrows imply a has a/many relationship. Note that when I say list in this diagram I do not mean wordList = []. I mean a class WordList {}. This structure means that you have to think carefully about where functions go, but it also avoids and circular dependencies and lets you do anything you might need to do as far as I can tell (P.S. From what I've heard, Node.js lets you include circular dependencies, but unless you're very careful it will probably break stuff and/or slow your code down). You will probably need to modify this structure as you implement/plan functions, but I would suggest continuing to avoid circular dependencies.

Here are some specific examples of how the structure above would work.

Iterating: LDAModel calls on DocumentList to assign new topics. DocumentList calls on all it's documents to assign new topics. Documents iterate over their tokenLists while using WordList to look up each word's topic distributions and uses StatisticalFunctions to do all the statistical calculations. It updates the topic of each token accordingly.
Topic Correlations: LDAModel gets the topic distributions of every document by calling this.documentList.topicDists() and then uses StatisticalFunctions to find any correlations
Resetting a model: Every class has a reset function that deals with what is inside that class. LDAModel calls reset in its members. They call it on theirs.

Other thoughts

Have a section in the code for functions that are meant to be the external interface of the class. One of the overwhelming parts of the current LDAModel is that you have to weed through internal functions to find what you're looking for. It would be real nice to avoid that. This also means continuing to use _this notation to indicate "private" class members. Also, if you use my class structure, I would suggest making all of LDAModel's classes "private". While this might seem redundant, as it will lead to functions that really just call the function of a sub-class, some functions will be in weird spots (see topic correlations above) and putting all the external function in LDAModel just makes everything more consistent.
More functions are better than larger functions. Every single LDA implementation I have seen uses functions that are way too long and nested. I'm guilty of this too. I don't know what it is about LDA but it just lends itself to functions with like 5 indentations. Every time I've made a function like this for LDAModel (or any class), I've regretted it later. Just don't do it. Functions like that are so much harder to reason about, hard to reuse, and if we ever try and make a testing suite, long functions are going to make it soooo much harder.
Legibility is extra important in this project. Something I'm realizing right now, as I am about to hand this off to a whole new set of developers for the summer is that people are going to come and go from this project very frequently. That is the nature of a project where most of the coding is done by people who sign on for one summer and almost all of the coding is done by undergraduates. So, legibility is always super important, but given the context of this project, it is even more important here. LDAModel.js is the most complex part of the entire project, so legibility is triple important in this class.

hmc-whisk / jsLDA

Refactor LDAModel.js #172

Refactoring LDAModel

New class structure

Other thoughts