probcomp / bayeslite

BayesDB on SQLite. A Bayesian database table for querying the probable implications of data as easily as SQL databases query the data itself.
http://probcomp.csail.mit.edu/software/bayesdb
Apache License 2.0
918 stars 64 forks source link

Documentation for newcomers to the code #635

Open jar398 opened 4 years ago

jar398 commented 4 years ago

I don't know what the plans are for this software, but if there is a reasonable chance that it will require bug fixes or new features in the future, I would recommend that someone experienced with the system spend maybe 1-3 hours on basic developer documentation - just enough to orient a new person to the code.

If no more work will ever be done, this issue is moot and should be closed as a 'wontfix'.

If work is needed it is likely to be done by someone who doesn't know how the code works. This is because everyone who knows about it is now too valuable to be doing maintenance, and has moved on to more important projects. When there is work to do, a new developer will almost certainly be recruited.

The first thing would be a list of documentation sources: the Crosscat paper, the API documentation on the probcomp web site, and anything relevant in other repositories. A pointer to how to start up a jupyter notebook so they can read and play with the tutorial. Things like that. No need to reproduce all this information if it's already written down.

Next, pointers to descriptions (or the descriptions themselves) of some of the basic internal data types and concepts in the code: table, population, generator, model, variable, view, cluster, category, backend, and so on. Maybe an overview of how the sqlite tables relate to the python classes (when they do). When data object references are used and when table indexes are used (and maybe why). Any modularity principles that someone should know about (e.g. what is the difference between cgpm_backend.py and cgpm.py and how do you know what is supposed to go in each).

Setup instructions for debugging would be useful. (For me, since I dislike docker, this included finding the list of dependencies and pip installing them in a virtualenv, and figuring out how to run either all tests or just the tests I cared about.)

I'm not talking about detailed documentation on methods and parameter and result types and that kind of thing; any programmer can get that out of the code. I'm talking about introducing someone to the concepts at a high level, so that when they find code that's relevant to what they're doing, and it deals in things that are not obvious (e.g. 'what is the difference between a column and a variable'), they might be able to get information out of the documentation, which might be easier than reverse engineering the code by tracing data and control flow across methods and files.

The CONTRIBUTING file might be a place to put this information, or a pointer to it, although some or all of it could go in the source files. Separate documentation should be attributed ('signed' by its author) and dated ('originally prepared on'...).

These recommendations apply to every piece of software, of course, not just this one. I am not saying I always do this myself, but I know that when I come into a project and find developer documentation I am grateful, and when I don't find it I am annoyed.