naobservatory / mgs-workflow

3 stars 2 forks source link

Adding a description of the pipeline to the repo #5

Open mikemc opened 1 month ago

mikemc commented 1 month ago

For understanding the pipeline, it would be helpful to have an up-to-date description in the README or in the Wiki, even a very barebones one to start would be great. Right now the main thing I know of is the flow diagram in the Feb 4 post. Is this based on an svg file somewhere? Perhaps you can check that into a docs/ folder and keep it up to date, and include it in the README?

Beyond that, it would be handy to have a section of the README that has subsections for each step that describes methodologically how that step works and points to the relevant configuration files where one could find the commands/parameters.

Ideally these would be updated when changes are made.

I'm happy to contribute, but would want to get pointed to a barebones up-to-date description to make sure I'm on the right track; even an up-to-date flow diagram is probably enough for me to start drafting this.

mikemc commented 1 month ago

I've now spent enough time looking at the repo that I can figure out what steps are implemented in the primary workflow by reading main.nf , so the need for me to have a barebones description or updated flow diagram is no longer important for me to personally figure the pipeline out.

A good basic set of documentation could be to have a subsection for each sub-workflow in sequence that gets called in the primary workflow in main.nf. That section could give a few sentence description of that sub-workflow and link to the line in main.nf where that workflow is defined to give more info.

I'm interested in ways we could incorporate documentation directly into the nextflow source code as comments which can be postprocessed by a script to make nice documentation --- I think this would streamline the process of keeping docs up to date. My inspiration comes from Roxygen2 in R and doc comments in Rust. The idea is that each process and workflow definition would have documentation in a comment above the associated processes or workflow definition, which a special symbol to mark the comment as a documentation comment. (E.g. in Rust, lines that start with "///", instead of the basic comment indicator "//", are considered doc comments.)

Nextflow does not have something like this already supported, but implementing a basic version of doc comments similar to those in Rust would be easy enough.

  1. Above process and workflow definitions, add a description of the function in GitHub markdown format, with each line starting with "///". We might also agree on some useful formatting rules such as the first line should give a 1-line summary of the function.
  2. A script looks for doc comments and associates them with the function defined in the line below them. It outputs markdown files in the repo docs/ folder, which can be seen nicely formatted in the browser when looking at the file in Github.
  3. If we end up splitting out sub-workflows into their own .nf files, we could create one markdown file for each workflow, and that file would show the docs for that workflow and processes defined in its associated file
  4. We could use doc comments to autogenerate the overall pipeline documentation by having some keyword to indicate what portion of the doc string should be used
jeffkaufman commented 1 month ago

I think this would streamline the process of keeping docs up to date

At least for now, when the number of people interacting with the code is low and they're all quite close to the code, I'm not convinced we need docs separate from the code? Instead, we could keep documentation comments in-line with the code.

mikemc commented 1 month ago

At least for now, when the number of people interacting with the code is low and they're all quite close to the code,

The above doesn't seem true to me (I think @willbradshaw is the only one who has a good sense of what the workflows and processes are doing and how they fit together, but @simonleandergrimm and I are using the pipeline output and may potentially be modifying/extending it), but I more or less agree with

I'm not convinced we need docs separate from the code? Instead, we could keep documentation comments in-line with the code.

This is essentially step 1 in my suggestion above but with not worrying about the formatting of the doc strings. Currently processes and workflows do not have documentation, and I suggest that the place to start is by adding documentation for each in a comment above the function definition (rather than starting by documenting things in the README or other external file) with things that will help me or a person like me understand the code, which includes things like a basic description of what the process/workflow does, what it's key inputs and outputs are, where to find any scripts or config files it depends on.

I'd still be interested in helping to do steps 2 and 4 later, because this is the sort of thing that I (and therefore users similar to me) find helpful and also because I find it personally satisfying to have nice docs, but I agree it isn't essential.

I think @willbradshaw is best suited to write the initial function docs (even just a sentence or two for each function would be a useful starting point), but it's also something I can write piecemeal as I dig into specific pieces, then have Will verify.