documentation pedantry - Githubissues

thunder-project / thunder

scalable analysis of images and time series

Apache License 2.0

814 stars 184 forks source link

Hey, I started trying this out and went through this tutorial: http://thunder-project.org/thunder/docs/install_ec2.html

There were a few dumb nuances which confused me slightly (as a total novice). I figured I'd state these explicitly in case there is any desire to clarify them in the documentation. Note: Jeremy was sitting next to me so these were resolved quickly...

1) In the opening section "Setting up an Amazon account", bash_profile also goes by .bashrc (e.g. on ubuntu). For emphasis, this could be stated explicitly to be on the local computer, prior to login to the cluster.

2) It is mentioned when talking about setting up the ipython notebook, "If this is the first time you are logging in, you must run a script that configures iPython notebook to run in SSL mode." The nuance is that after a stop and start this needs to be done again. In retrospect this is obvious because a new cluster is instantiated, but to me as a novice this was slightly non-obvious.

3) Along with the ipython notebook details, a mention of the availability of http://.....amazonaws.com:8080 would be helpful. A trick such as 8080 is admittedly not just a thunder thing -- perhaps just a link to documentation on this would be useful. Could also mention the issue of if RAM is full, then object spills to disk (or maybe this is not thunder specific enough).

3) A presumably temporary problem is that before running the ipython notebook, one needs to run pip install jupyter. Jeremy suggested this will be solved at some point. But for novices it might be mentioned how pip install works on master/slave nodes. Jeremy sent me: pssh -h /root/spark-ec2/slaves 'source ~/.bash_profile && pip install bolt-python' when I needed to install bolt on slaves. Had he not been here, the nuances of this would have been non-obvious. This could be documented for other uses who need packages. If this is outside the scope of the thunder documentation, a relevant link would at least be useful.

Also, some slightly advanced documentation might be useful. For example, I would benefit from a brief layout of a few things like ec2.py being the script that controls the installations on thunder -- maybe this already exists somewhere or is considered irrelevant, but a few paragraph description of the layout structure of the source might make it more accessible for developers (or maybe they're all smart enough to figure it out without this).

In going through tutorials on the new documentation for Thunder 1.0.0 (docs.thunder-project.org), I noticed a few similar issues that could be confusing. They are mostly things that existed from the text of the earlier documentation that didn't get switched to match the code for the new 1.0.0 structure. Following the code is always correct and works as expected, but some of the discussion between code blocks wasn't changed appropriately. Here are a few I noticed:

Basis Tutorial, 3rd Paragraph after the line of code that loads a series fromexample('fish'): data is a Series object, which is a generic collection of one-dimensional array data sharing a common index. We can inspect it to see its shape, dtype, and the fact that it's currently in local mode. However, data isn't defined previously above. It is called series in this example.
Registration Tutorial, Generating Data Section, 2nd Paragraph after inspecting data: There are 500 images (corresponding to 500 time points), and the data are two-dimensional, so we'll want to generate 500 random shifts in x and y. We'll use smoothing functions from scipy to make sure the drift varies slowly over time, which will be easier to look at. It states there are 500 images, though with the new example data there are only 20 images (at least in local mode).
Registration Tutorial, Generating Data Section, 3rd Paragraph, after plot of dx and dy: Now let's use these drifts to shift the data. We'll use the apply method on our data, which applies an arbitrary function to each record; in this case, the function is to shift by an amount given by the corresponding entry in our list of shifts. It states to use the apply method, though that is no longer used. A map call is used along with the shift function to register the frames.
Registration Tutorial, Registration Section, 3rd Paragraph, after viewing reference image: Now we use the registration method reg and fit it to the shifted data, returning a fitted RegistrationModel It references using the registration method reg, however it is called algorithm in this tutorial that is fitted to the data.

Again, following the code is clear enough to understand what is happening, but it might be confusing for some individuals coming from matlab and not as familiar with Python. I noticed some of these things didn't change and recognized the old versions from the previous tutorials, so I thought I'd document it here for the next time somebody updates the documentation.

thunder-project / thunder

documentation pedantry #233