Closed conradsnicta closed 4 months ago
We need to hire someone to take care of our docs. Docs are killing every part of this library that we have spent years working on. Having bad docs can send the user away, even if the code is nice and perfect.
I hate Doxygen and have been against it for years and years, better alternative is ensmallen doc style or gitbook style.
The issue we have is no one has the bandwidth to handle it, and it is not easy. And yes we should remove all the old doc.
One important fact we forget about is that mlpack is a header only now. So there is no need to maintain the old version not even for old systems. Anyone can pull the header and build an app as long as it can be compiled easily.
I opened https://github.com/mlpack/mlpack/pull/3331 to fix references to documentation in the code, and pushed 3725ce2 to fix the issues @conradsnicta pointed out.
However, @shrit is right: our user-facing documentation is in poor shape, and parts of the API are not very discoverable. For instance, does anyone know we have a hyperparameter tuner? It's pretty cool! But a little hidden. (That said, I will argue that the documentation in our code is pretty nice, but, that's not very helpful to a user.)
We did pull Doxygen out of our documentation workflow: https://github.com/mlpack/mlpack/pull/3265; this is a really good thing because the Doxygen-generated documentation was actually uglier than the source code, even though I tried a lot to make the formatting better.
I have always been a little intimidated by mlpack's documentation problem, because it seemed like a very hard problem, especially due to the complexity of the interfaces. But while I was reading this, I thought, maybe it is possible to have an easier revamp of the documentation. The only thing that is missing is a vision for how it should be. So, a couple questions and notes for the purpose of brainstorming:
What to do about the C++ interface vs. the bindings interface? This is a big issue. We have these auto-generated bindings that wrap C++ functionality to other languages---but they're fundamentally different than the user interface in C++ itself. How do we reconcile this, and make it clear to users that in other languages, the interface is uniform, but in C++, there's much more you can do? One way might be to actually generate C++ bindings, so that e.g. you have a mlpack::knn()
function just like Python's mlpack.knn()
, but then the problem is that there's also the mlpack::KNN<>
class, which differs from the model type you'll get back from mlpack::knn()
. These inconsistencies could be problematic.
What should we do with the old versions of the documentation? I think I agree here that simply nuking them is fine. @conradsnicta is right that these have caused more trouble and confusion than utility.
What should the structure of our new documentation be? I was fairly happy with the Markdown restructuring of the user-facing tutorials and developer documentation, which you can browse here. How do we extend this to mlpack C++ classes? Do we write, say, an adaboost.md
file with basic documentation about the AdaBoost<>
C++ class and its template parameters? What level of detail should it be? (Aside: a long time ago, I thought that any method-specific documentation had to have significant explanation of what the algorithm itself did. That's obviously way too much work.)
How do we keep code snippets from going out of date? Maybe this is a problem we don't solve, I'm not sure. Maybe we restrict all snippets to be complete C++ programs? Given the boilerplate of C++ though, I'm not sure that's realistic.
What's the user entrypoint to the documentation? Seems like we need hooks in the README, on the website, and in the code itself.
There are some parts of the library that will be pretty easy to document in a style that is like ensmallen and its Markdown documentation. For example, the QDAFN
class for furthest neighbor search has basically two methods (Train()
and Search()
) and the only template parameter is the matrix type. So that will be very trivial to document. On the other hand, the FFN
class will be very intimidating, since it has many template parameters and we would also need to document all of the layers.
If we can come up with a basic vision of how the documentation should be, we can approach the simple classes quickly, and then if we do hire someone through GSoD or similar for the documentation, what they will need to do will be clear.
I don't have answers to the questions above, but I'll spend some time thinking about it. If anyone has any suggestions, please, I'm all ears---let's figure out some lightweight way we can make things better.
I do not see it as a very complex problem, to be honest, in fact, it is simpler than we think.
For each of the methods / bindings, C++, Python initially, and other binding later, just provide an example of how to use the train function and the prediction function (public API) and that is it.
This will allow two things:
That is only what a user is looking for, we do not have to make it too complex, just a simple example with basic parameters and the user will be able to take off from there.
To structure them it is simple:
Start point
Method name A:
Method name B:
Method name C:
etc.
We can ignore examples from other bindings at this point, or keep them separated (e.g., CLI), they will be added in the future.
All tutorials should be removed because they are useless, no one has the time to spend all day trying to read what we write. Instead, go to the point directly, no need to explain what is collaborative filtering, because if the user is looking for it, then they know about it already.
Once the structure is established it will be much easier to add to it, and we do not have to start with complex templates parameters and complex function calls, since most of them are not used at first, instead just start with the default one and then let the contributors add more if necessary.
I also do think that is hard to maintain because the public API will not change until mlpack 5, which is far.
To give an example, no one wants to understand what water is, but instead, people want to know how to use it and what is useful for.
ill-formed documentation is making this hard enough to use...I downloaded this library to use for my hyperspectral medical image processing and I am failing to get any idea how this toy works...There are too many ambiguities of use case of classes...
@rcurtin I agree with @shrit here -- there is no need to make this more complicated than it has to be. Right now the documentation appears all over the place and lacks focus (ie. mixes CLI and C++ docs, jumps between mlpack.org/docs.html and pages on github).
It would be really helpful to have one main point of contact, from which all documentation can be found. This should be preferably be hosted on one site (say mlpack.org), so it can be easily indexed and found by search engines. Even having a dedicated section on readthedocs.io would work.
The current situation is suboptimal, at least from the C++ perspective. It's almost as if C++ is a second class citizen (compared to the other language bindings), which is ironic given the library is primarily written in C++.
As an example, the README.md file (within the mlpack repo) points to "https://www.mlpack.org/docs.html" under Quick links list. On that page, clicking on C++ gives us: "Sample C++ ML App for Windows", "File formats and loading data in mlpack", etc. Where is the link to, say, k-means? Or neural networks? These are confusingly (without mentioning C++) provided further under the "Tutorials" section, resulting in a C++ user being already lost at this point.
The problems then continue. Within the k-means tutorial, there is a detailed description of what a general kmeans algorithm does, which is not necessary (as per @shrit's comment, we already know that water is wet). Then the tutorial goes on about the CLI interface to kmeans, which is distracting (and arguably not useful -- see last paragraph below). After about a third of the page we finally get to a section titled "The KMeans class", which shows an example that doesn't really run (ie. "what are all those extern
variables doing there?").
It would be far more digestible (ie. easier for end users) to have a very brief description of the kmeans class and then jump quickly into a working C++ example, avoiding all the detailed introductory and CLI related material. Purely as a point of contrast, Armadillo's documentation for its kmeans function is succinct: https://arma.sourceforge.net/docs.html#kmeans
Another point of contrast is ensmallen documentation, say for the ADAM optimiser: http://ensmallen.org/docs.html#adam
Lastly, I really don't see a compelling use case for the CLI "bindings"/programs. The vast majority of use cases would be that mlpack algorithms would be part of a bigger machine learning pipeline, say in C++ or Python. On the command line the "pipeline" would be a shell script, which is really poor (inefficient + limited) from an ML-pipeline perspective. This is why scripting languages exist (eg. Python), where objects (that reside in memory) can be easily passed between functions, instead of being written to and read from files all the time. The corollary is that if CLI bindings are to be complete, then low-level stuff like SVD should also be a separate program. To my understanding, the core focus of mlpack is that it is a library/provider of machine learning algorithms (primarily accessible from C++, but with reduced-functionality bindings accessible from other languages) that can be used in bigger programs. The core functionality is not the provision of CLI programs -- that can be a separate project, possibly written by someone else.
I would argue instead of bringing everything under one roof, neural networks, clustering, regression, neighbor search, etc. we would have it easier if we split mlpack into different sub-projects. If users are interested in clustering, they will not find documentation for linear regression or anything else. We use the armadillo and ensmallen documentation as an example for a better way to do documentation, but I would argue that is because those projects focus on one thing only.
If we split things up, would it be a huge issue to have handwritten documentation (I would rather spend time to figure out, how we could test documentation than auto generate the documentation itself)?
@zoq I am convinced that everything should be under one roof, the reason for this is that we already know some working examples (e.g., armadillo, ensmallen, JSON for modern C++, cpp-httplib, etc..) and basically most of the header-only libraries today are following this logic. This can synchronize with the developer's mindset as they will expect to have everything in the same place. Starting to cut the pie into pieces will make it harder for anyone to follow and for us to maintain.
I would only cut it into pieces if the above method shows a huge failure and people are still lost. Otherwise, we should follow the common logic
@conradsnicta you're completely right with all the problems you point out. The reason things are this way is that we have never had a good solution for C++ documentation, and we have a mediocre solution for the binding documentation. So it ends up looking like the bindings are the "first class" of the API, and, yeah, it's not great.
What you provided was an example "user flow" through the documentation, which you pointed out is 'suboptimal' (I think that is maybe an understatement :)). But users come to mlpack for multiple reasons: they might want bindings for another language (including the command line bindings, which were featured in Fedora Magazine), or they might want C++. It should be clear to users how to get to what they want to---but then also it should be clear how to transition from using bindings to using C++, and vice versa, since we advertise the ability to take code you wrote in one language and easily reposition it into another language. What I wrote in the first bullet point of my last message was a half-baked idea on how to actually provide that transition in a nice way. I have baked this idea more, but, let's not consider it in this issue. I think it is somewhat orthogonal. When the idea is fully baked I'll propose it in a separate issue or discussion.
The last three bullet points I left still need to be solved though. If we choose to document all the 'user-facing' C++ classes with Markdown documentation like we do for Armadillo and ensmallen, that's a huge step forward (and I think we should do that!), but it is not clear to me yet how to organize these on a website in a way that users can navigate. As @zoq pointed out, the ensmallen and Armadillo documentation are easier to organize because their functionality is more restricted. But mlpack is general-purpose and it is a little harder to organize. Plus, we have the issue of the bindings, which neither ensmallen or Armadillo have (the bindings are separate projects).
I guess what I am hoping for is some vision of the way the user navigates the documentation. My mental model is a tree, where the root is the homepage of the documentation. We could further add links between branches in this tree. Maybe this is not the best mental model, so I am open to other ideas. If someone has a 'tree' to propose, I'm all ears; if not, I'll try and come up with one and we can see what we think (it could be a little while until I feel inspired). Whether we eventually break mlpack up into different tools or not---or whether the documentation is all on one page or not---is actually a separate concern, because in either case prospective users still navigate through the documentation hierarchy in a similar way.
@rcurtin Short answer first: the documentation for pybind11 is a good model to adapt: https://pybind11.readthedocs.io/
Long answer. A tree structure is doable, as long as at any point in the tree we can see links to the rest of the documentation. Say someone enters one of the documentation pages based on a search result (eg. via Google). It should be trivial to see the links to the rest of the docs from each page of the docs. This can be done by a fixed menu on the left.
I suggest that all of the documentation is placed on one site, rather than being spread out between mlpack.org and github. Github is actually pretty awful for hosting documentation (as markdown pages) due to all the extraneous chrome and a bazillion distracting links; lack of navigation is also a big problem.
With the new website deployed, I think we can (finally!) call this one solved.
@rcurtin @zoq @shrit The online documentation seems to be malformed or missing C++ docs. There is also further sets of weirdness that are very confusing to new users.
On the mlpack docs page, clicking on Binding documentation under Python Bindings naturally leads to https://mlpack.org/doc/mlpack-4.0.0/python_documentation.html
However, this page then states "The C++ interfaces of mlpack are carefully documented and doxygen is used to provide automatically-generated searchable documentation". The provided link to https://mlpack.org/doc/mlpack-4.0.0/doxygen/index.html results in "Not Found" (ie. the doxygen page doesn't exist).
There is also further weirdness / inconsistencies on the Python documentation page, as well as all the binding documentation pages: in the menu on the left, all the languages are listed but not C++. This is confusing even for me.
Say someone found the page directly via Google, and now they can't find the C++ documentation for mlpack. At the very least there should be a stub page for C++, pointing the user to, for example, https://github.com/mlpack/mlpack/blob/master/doc/quickstart/cpp.md
Related to mlpack docs found via Google, folks can end up at the documentation for old versions of mlpack (eg. see issue https://github.com/mlpack/mlpack/issues/3326), and end up getting really confused. I suggest to remove all old versions of the documentation and keep only the current one. Keeping old versions leads to problems like the above and is also a maintenance nightmare. Old documentation shouldn't be visible unless old versions of mlpack are actively maintained.
(As a side note, Armadillo used to have documentation based on Doxygen very early in its life. I've abandoned that very quickly after realising the Doxygen documentation is too difficult, ornate and cumbersome for typical users; it's very easy to get lost in it; even if one does find the right function, it's still difficult to grok. I think it would be very helpful for the uptake of mlpack if there was one simplified "go-to" documentation page similar to Armadillo and Ensmallen).