mspass-team / mspass

Massive Parallel Analysis System for Seismologists
https://mspass.org
BSD 3-Clause "New" or "Revised" License
30 stars 12 forks source link

documentation #37

Open pavlis opened 4 years ago

pavlis commented 4 years ago

It is way past time to make a decision about how we are going to do the documentation for mspass. There are a least three components we need to address sooner rather than later.

  1. How to document the C++ API? The the most important, but the one that is solved other than how to be autogenerated. The unambiguous standard is doxygen. Particularly given since all the code I've developed so far assumes doxygen syntax to document the api.
  2. How to document the python API? I have no experience with this issue. I've seen some hints on the web, but you (Ian) need to make a decision on this and we need to start generating documentation similar to that obspy soon. The mix of C++ and pure python code likely complicate this issue.
  3. We need to start a user manual that documents design concepts as we finalize them. Otherwise an old guy like me will forget them or you we will both get distracted with something else and it won't get written down. I think we should start by writing an outline in this issues section. The outline can provide a skeleton to be fleshed out as this evolves. The main decision point here is what format and approach we should use to create this manual. I don't think the wiki on this github site is appropriate. I would suggest simple html pages under version control in this github site.

Let me know what you think.

wangyinz commented 4 years ago

I think we had a conversation about the Python documentation thing before but could not find a record in this repo. I guess that must be in a telecon. Anyway, I think the guides from Obspy are pretty good, and we should follow their practice, especially the coding style guide that shows how they use doc strings to document the API. Not sure how to handle the mix of C++ and Python yet, but I do believe there has to be a clean way as you have the doc strings in the Python bindings already.

For the user manual, I do think we need something better, but I have not dug into that yet. I know a lot of the projects uses Read the Docs, which could be the one to go with. Also, you might not already be aware of that GitHub's wiki page is actually under version control. You can see the commit history and even clone it locally with git clone https://github.com/wangyinz/mspass.wiki.git.

pavlis commented 4 years ago

VERY useful sources there. Really like this sermon that I found as a link in one of those pages. The author is right on about this issue. If you haven't seen that you must read it.

Two actions, I think, are followups:

  1. I need to complete all the docstring sections in the mspasspy wrappers and experiment with sphynx that I believe is what obspy uses. What I don't see, which is perhaps what you meant by "Not sure how to handle the mix of C++ and Python yet", is how to use extract the docstring data in the wrapper file to build consistent python pages. Suspect strongly something might exist to handle this we can find with a web search. We'll see. Hope we don't have to do a custom development - we have more important problems to solve.
  2. We need to start with an outline for a user's manual and tutorial. We need to start the first immediately to document key concepts we are building this package upon. The second should probably be postponed and made in the form of a python notebook. Might consider developing this in combination with one of our initial workflows?
  3. I have no opinion about where and how we maintain this documentation. I'll go with what you decide on that issue. My point about github's wiki, which may be wrong, is that the wiki doesn't seem to support the range of document types we will likely need to use for documentation. To be specific, algorithms developed by us or others will almost certainly need a document format that allows both equations and bibliographic references. One needs at pdf support, and likely a whole list of others (e.g. tex source files).
pavlis commented 4 years ago

I have produced a prototype home page that is only a raw table of contents for the package documentation. I'm going to check this into the master branch under a new directory docs/html. The file is index.html. If you think that is a bad file organization, change as you see fit and handle such changes with git. If you have any suggestions on organization or content in my prototype request we discuss it here before making changes to preserve any history of our thinking on this matter.

pavlis commented 4 years ago

This could have been in the previous comment, but it is a slightly different issue. Suggest we do all the documentation in a web oriented form like html. At one point I thought we might want some pdfs to handle technically oriented documentation of algorithms that require a lot of equations. Probably can still do that, but I had forgotten until the morning about the existence of things like mathjax (I think that is the name) that allows embedding tex descriptions of equations in html documents.

Agreed we should plan html as the core format for documentation?

wangyinz commented 4 years ago

Yes, I think html should be all we need. It seems we could just host our documentation on readthedocs. The only issue is to figure out how to put doxygen generated pages there as that is a site designed mainly for python projects.

pavlis commented 4 years ago

If you think readthedocs is a good choice, we could just serve the doxygen pages form indiana or Texas and have a link in the appropriate pages. Suggest maybe IU is better as an emeritus faculty member I will have an account here until I die, which we hope isn't too soon.

wangyinz commented 4 years ago

Found this that shows it might not be that hard to have doxygen on readthedocs.

btw, do you have a good reference for doxygen. I have never compiled it before, and I need to learn all these first to build our own document site.

pavlis commented 4 years ago

That looks pretty easy. Only issue I'd see with that is it will require doxygen installed by cmake. That build is already getting pretty long, although I don't think doxygen is that huge. Also have no idea if it can be autoinstalled. Worst case we'd have to have the install documentation say the user needs to install doxygen if they want to have a private copy of the C++ api pages.

wangyinz commented 4 years ago

I don't think we should include doxygen as an aotuinstalled package, instead we could add a make doc target when doxygen is found in the system. Probably something like this would work.

pavlis commented 4 years ago

Like what you did to build the documentation in github automatically as we do updates. Spectacularly useful way to make sure the documentation stays current with the documentation. Brilliant.

Subject here though is adding the documentation for the schema. I found some examples online for different ways to build tables with rst. This one looks the most promising to me. It would be very easy to create a csv file from the mspass.yaml file (well actually from MetadataDefinitions in python would be how I'd create it) Thought of this when I was perusing the new documentation and remembered we needed a way to document the schema. This fits perfectly with the dynamic update model as adding a new attribute to the mspass.yaml file would cause (ideally) to appear in the table(s) created by that mechanism.

Note also:

  1. The csv files would be useful to people anyway
  2. The documentation could and should show the attributes in multiple tables. Besides the obvious all sorted alphabetically by name, there are some group tables that are not just useful be essential. That includes: site collection, source collection, obspy required, ccore required, elog collection, history collection, and probably others.
pavlis commented 4 years ago

I started to seriously explore juypter notebooks the past few days. It seems the unambiguous solution for creating tutorials for mspass. Do you concur? If so, I think I will start creating one for running the deconvolution code with the test python codes I was writing last week. "Kills two birds with one stone" as the saying goes. Provides a test program and a tutorial all in one.

I'd like to start a dialogue here on what tutorials should be developed and if jupyter is the right medium. Look forward to your response.

wangyinz commented 4 years ago

Yes, we should definitely go with Jupyter. Practically, the tutorial should be better put in a different repository as it is not considered as source code nor documents.

Another issue is that we will want to have Spark included in the tutorial. Although we could still use jupyer for that, the setup will be different and the code won't work properly in a common jupyter setup. Currently, I think the solution is to have jupyter in the container, but that might unnecessarily inflate the image. Maybe we should release two different container images down the road: one with only the core components and one with everything including Jupyter.

pavlis commented 4 years ago

I'm not sure it would be wise to split up the repository for documentation. Jupyter isn't that large a package and should be a marginal add on to an already large container. Further, I found this site that argues it is good to put notebooks in docker containers. A good tutorial will need a complete setup to be effective, which is why having it run under docker would be helpful.

In any case, we concur that jupyter should be the way we structure tutorials.

pavlis commented 4 years ago

Over the past several days I've had time in the morning from the time zone skew to work on the documentation pages for the MongoDb schema. I wrote a small python program that build a series of csv files that can be used to build pretty tables as noted in an earlier section of this issues document. There is then a master rst file that uses a "files" directive to read the set of csv files to build a readable document. To be specific, the current set of csv and rst files are the following:

  1. MsPASS_Schema.rst - the rst document
  2. The following set of csv files that the rst document references:
    3Cdata.csv
    MongoDB.csv
    aliases.csv
    all.csv
    files.csv
    obspy_trace.csv
    phase.csv
    site.csv
    sitechan.csv
    source.csv

    The csv files are intended to be automatically generated by running the python program in the same directory. The name of the program is irrelevant at this point. There are multiple files because all.csv lists all the attributes while the others are used to build smaller tables that have a logical or required relationship.

The issue this brings up is I have no idea how to use this to provide automatic updates of the documentation when when we update the schema definitions? Eventually this should stabilize by for the near term the set of attributes that define the schema are likely to change a lot. It may be appropriate to just say we'll manually update the csv files whenever the mspass.yaml file is modified. However, the text of the rst file will change much more slowly, I suspect, than the tables. As a minimum we will need a way for csv files and a python script to live in harmony with rst files in the documentation source directory.

Wait for me to check this in if that is too confusing. I am writing this from the Phoenix airport and should be home later this afternoon.

wangyinz commented 4 years ago

I still have not see the actual files, so probably not understanding it correctly. I think all we need is a python script to parse the mspass.yaml file, and generate a number of rst files to be used to generate the documentation. If you already have a python script that can do similar conversion, then we should be pretty close to make it completely automatic. I should be able to add that into our current sphinx setup.

pavlis commented 4 years ago

Yeah, that would have been impossible without the material I just checked into master. Here is he procedure I've been doing manually. We need to either put this somewhere that we can repeat this with a few commands when we need to make a schema change or automate it. Suggest you first make sure you can repeat this procedure in a scratch area before trying to figure out how to automate it.

  1. For testing create a scratch directory.
  2. Copy these files to that scratch directory (all paths are relative to the root directory of mspass): python/bin/build_metadata_tbls.py, python/bin/build_metadata_tbls.pf, ~/docs/source/MsPASS_Schema.rst. The first is the python program that will be run. The second is a parameter file used to construct secondary group tables. The last is the rst file that builds a readable document.
  3. Because we don't have it automated copy data/yaml/mspass.yaml to $MSPASS_HOME/data/yaml.
  4. Make sure MSPASS_HOME is defined.
  5. run the python script: python build_metadata_tbls.py. You should see this generate a set of csv files. These are hard coded into the rst file with lines you will see if you poke around that file.
  6. Run something to convert the rst file to html. I have been using the low level rst2html that is part of docutils but you will get something prettier if you run it through sphynx.
wangyinz commented 4 years ago

I am still in the middle of making all these run with sphinx. One issue I realized when doing it is that the $MSPASS_HOME is not included in our Python setup. I am thinking to create some kind of default alternative hard coded in the code so that we don't need to worry about the env variables within Python being messed up somehow. Probably not something to worry about for now, so I copied that into the mspasspy package. Still need to figure out a robust way to define $MSPASS_HOME. Anyway, I will got all these resolved eventually...

pavlis commented 4 years ago

I thought about putting the topic of this comment in a new issue, but decided it mostly fits into the topic of documenation. The problem is it could equally be put in a discussion of test programs, but the tougher issue is documentation so I'll put it here.

The problem I want to discuss comes up from testing the new graphics module I've been developing. The only way I know to test graphics code is to make it draw something and visually see it is worked. Graphics generating test programs are nearly guaranteed to break Travis, or so I suspect. What I propose to do is enhance the test program I've been using a bit and make it a jupyter notebook tutorial on mspass graphics. The notebooks provides a convenient way for one of us to verify the graphics module is working correctly and at the same time builds a valuable tutorial. Do you concur or do you have some other mechanism to test graphical code? Independent of testing I do think a jupyter tutorial on the graphics module is an important addition. There is not better way to get most scientists hooked than to have a simple graphics system where they can get a pretty picture quickly.

This brings up a couple issues related to documentation.

  1. I think most of our tutorial material should be aimed toward jupyter notebooks. I presume the best place to put the notebook files is in docs/tutorials like the one we have no for metadata. I wonder, however, if jupyter files shouldn't be in a new directory under the tutorial directory? You are the keeper of the build system so I don't know if would help or add unnecessary complexity.
  2. Perhaps the biggest point of this comment is how to assemble ancillary functions used only in tutorials? That is, for testing both the deconvolution and now graphics code I had to create a bunch of small, special functions utilized by the test program. They are needed for those programs, but are not suitable for mspass processing. A crude way to handle this is to paste these functions into a notebook that uses them and tell the reader to run that set of code first. The trouble with that is some could easily dive into the code without reading the instruction and get lost in the forest of auxiliary code that is a side issue for the educational point of the tutorial. The solution, I think, is to reduce the setup to imports. i.e. the first block of code for the notebook would give the instruction to run this small block of code that would look something like this:
    import matplotlib.pyplot
    import numpy as np
    import tutorials

    were tutorials is a python module containing the not for public consumption ancillary code needed to drive the tutorials.

First, do you concur that putting this kind of stuff in a special module(s) is the way to go with this?

If you concur there are two issues I see: (1) where to put this module and (2) how to set it up so its components don't get posted in the documentation?

I think the solution is simplified by adopting the habit to only run jupyter tutorials from the docker image. That way we can put the tutorial.py (or whatever we call it) in a common place outside the mspass tree and the notebooks can reference it and be assured it will be found. If you concur this is the right model maybe you can judge better than I how to structure the tutorial area with that in mind. I'm going to look into running jupyter from docker - I know is a standard approach as previous web searches yielded long lists of how tos on the subject.

pavlis commented 4 years ago

Followup: found this useful page on jupyter and docker

wangyinz commented 4 years ago

Well, I think you are getting at exactly the reason why people always host a separate repo for tutorials - there can be a lot of unrelated code that only make sense to the tutorials. At the end of the day, documents and tutorials are two distinctively different things, and we probably shouldn't put them together just for convenience. Since you are preparing the tutorials now, maybe it is the time for us to open up that new repo. Yes, and I think we can use docker with Jupyter in that new repo.

wangyinz commented 4 years ago

There we go: https://github.com/wangyinz/mspass_tutorial

wangyinz commented 4 years ago

btw, for testing graphics, I think you are correct that nothing works better than a human eye. I don't think there is a good way to test the visual correctness of the plot itself. As discussed here, we probably could use the method in the accepted answer there to partially test it.

JiaoMaWHU commented 4 years ago

Hey folks, it seems I missed a lot of discussions here... what is this tutorial used for? for future mspass users?

wangyinz commented 4 years ago

Yes, it will be for future users. I think you can mostly ignore this part for now. Tutorial is different from documentation, and you only need to write the latter for the code you develop.