piskvorky / gensim

Topic Modelling for Humans
https://radimrehurek.com/gensim
GNU Lesser General Public License v2.1
15.65k stars 4.37k forks source link

Improving Tutorials #1964

Open steremma opened 6 years ago

steremma commented 6 years ago

Documentation in general

In general, a well written and maintained documentation can be divided into 4 concrete elements as explained in this talk:

  1. Reference
  2. Discussions
  3. Tutorials
  4. How to Guides

1 in our case is achieved with the docstrings and the Sphinx building the html content from them. We already have a strong base reference and multiple people are working on increasing coverage (myself included).

Regarding 2, we already have gitter, mailing lists and issue specific discussions in the GitHub issues and Pull Requests pages.

We have some overlap between 3 and 4 since we use jupyter notebooks for both but it is not very clear which notebook is a tutorial or a guide. For example I would say the sklearn_api notebooks are tutorials because they only show the basics (holding the user by the hand), while the model specific notebooks like word2vec IMDB/Wikipedia are more like guides because they solve a very specific problem. Perhaps we need to split them into two categories (folders) named tutorials and guides.

The problem

Its extremely important that the tutorials run always (new releases do not break the notebooks) and everywhere (the tutorial will run on every OS, python version or distribution etc.). This is very hard to guarantee at the moment.

Solution

We need to test the notebooks. By testing I mean make sure all cells run and not raise any error. (Can we also test for exact outputs?) Most google results show bad solutions but these two seem promising (although a bit hacky):

Advantages

Disadvantages

Alternative

As we discussed with @menshikh-iv, maybe we should migrate from notebooks to Sphinx gallery. Using this approach our tutorials are Python scripts.

Advantages

Disadvantages

Extra thoughts

Could we also add visualizations in tutorials? This is easy using both alternatives but I am not sure if we can come up with meaningful graphs.

menshikh-iv commented 6 years ago

Thanks for posting @steremma :+1: Guys @yurkai @anotherbugmaster @CLearERR let's discuss current point!

menshikh-iv commented 6 years ago

Some thoughts:

About 2: we almost ignore stack-overflow, this doesn't look like a good idea (because this is very popular QA service and significantly most popular than our mailing lists)

About 3, 4 (general view): we didn't split it, we have many notebooks that contain:

It's all in a heap and it's very difficult to find what you need. We have no any kind of "index".

About "testing" that proposed by @steremma - I'm already trying these solutions, this doesn't work, because

  1. Most of our notebooks need several hours for the run
  2. In notebooks, users import any random stuff (that even not our "test dependency")
  3. Impossible to check code-style correctly in this way (thanks "ipython magics" =/)

Also, notebooks produce many problems like

For this reason, @anotherbugmaster and I propose https://github.com/sphinx-gallery/sphinx-gallery approach for "tutorials" and "how to" guides (instead of notebooks). But for the case, when our notebook demonstrates large (in the meaning of size of data/running time/consuming memory) end2end example - we can create a new repo (for notebooks only) and move it to this repository.

Also, "for free" we receive nice features:

About Disadvantages of this approach that was mention by @steremma:

Requires new dependencies that have to do with plotting because this is mostly a plotting framework.

This is not a problem, because of current dependency for documentation only (not "core" dependency).

Will take more time to implement since I need to get familiar with it.

This isn't hard, really, also @anotherbugmaster make several examples, how to use this in #1809

It is not entirely obvious to me how we will be testing it.

It's very simple: when you build documentation - you run all of these "gallery" examples (this is one CLI option).

steremma commented 6 years ago

I agree with the points raised, especially the difficulties with keeping track of notebooks as I have experienced the same issues in my workplace. The linked PR will be a useful reference for working with sphinx gallery. I will start looking into it as soon as I complete my previously assigned tasks (probably in 1 week from now)