cytograph needs some basic documentation

gioelelm commented 7 years ago

cytograph is heavily underdocumented. This has recently been problem for both @kasiletti trying to get started with the library and me trying to come back to the code I wrote after a while.

There is no easy way to visualize the tasks tree (and no, the built in graph viewer is not an option, it is only meant to work for running tasks). Docstrings are short and not very descriptive. Documentation of the options that one can pass to the command line interface and on which task they have effect would be really needed.

With this issue I suggest that we join efforts to make an effort to document cytograph a bit while we work with it over the next months. It does not have to be something production-ready, but a documentation layer that spares us the work of going and read the source over and over again! Me and @kasiletti could add new docs/dosctrings while we modify/read the code. @slinnarsson, since you understand the luigi internals better, maybe you could help with some ideas on how to build a tasks tree inspector/visualizer.

Regarding how to build the docs, I did not find any sphinx integration or automatic documentation tools for luigi (i.e. nothing like this) , it looks like we will have to do the CLI api of the docs manually. The api of each task can be automatically estracted, and it is probably the best place to document what each task is doing. I suggest that we get started directly with reStructuredText format, allowing to build a neater html documentation with spinx later on.

gioelelm commented 7 years ago

Regarding drawing a graph using a simple loop and graphviz might be the way. As a reference the first cells of this notebook https://blog.jakuba.net/2017/05/30/tensorflow-visualization.html

slinnarsson commented 7 years ago

Agree that we should try to document the code better. Also, we need to agree what parts are project-specific and what parts are intended to be generic.

Drawing a generic tree of task dependencies (independent of a particular running task graph) is not so easy, since requirements are generated programmatically, and can depend on parameters. Furthermore, in order to list the dependencies of a task, you need to make an instance of it, but some tasks have required arguments. In those cases, you would need to figure out what argument to pass, since there is no default. We also have dependencies that are generated from files (e.g. the list of samples).

Anyway, for now I added an option to the task-specific logger that will generate a dependency graph for that task, in Graphviz-compatible dot format. You can dump the current task graph to the task-specific log by passing true when creating the logger in your task's run() method:

def run(self) -> None:
    logging = cg.logging(self, True)  # pass True to write task graph
    ....rest of the code....

This will log something like this:

digraph G {
"TrinarizeDev" -> "AutoAnnotateDev";
"SplitAndPoolAa" -> "ClusterLayoutDev";
"ClusterLayoutDev" -> "ExpressionAverageLineage";
"ClusterLayoutDev" -> "ExpressionAverageTime";
"PlotClassesLineage" -> "LineageAnalysis";
"PlotCVMeanLineage" -> "LineageAnalysis";
"MarkerEnrichmentLineage" -> "LineageAnalysis";
"PlotGraphDev" -> "LineageAnalysis";
"ExpressionAverageLineage" -> "LineageAnalysis";
"PlotGraphAgeLineage" -> "LineageAnalysis";
"ClusterLayoutDev" -> "MarkerEnrichmentLineage";
"ClusterLayoutDev" -> "PlotCVMeanLineage";
"ClusterLayoutDev" -> "PlotClassesLineage";
"AutoAnnotateDev" -> "PlotGraphAgeLineage";
"ClusterLayoutDev" -> "PlotGraphAgeLineage";
"AutoAnnotateDev" -> "PlotGraphDev";
"ClusterLayoutDev" -> "PlotGraphDev";
"ExpressionAverageTime" -> "TimeAnalysis";
"ClusterLayoutDev" -> "TrinarizeDev";
}

You can visualize this using Graphviz. Write the graph specification to a file (e.g. test.graph), then do:

dot -Tpdf -o test.pdf test.graph

The output will be something like:

test

slinnarsson commented 7 years ago

Actually, maybe it would make sense to split out the project-specific Luigi tasks to their own Python projects (and their own GitHub repos), and let cytograph contain only the shared parts. This would solve a couple of significant problems:

Changes to some early luigi tasks (level 1, the classifier, etc.) currently affect multiple projects. It makes sense to have them be project-specific. For example, I have introduced the BalancedKNN in levels 2 and 3 for the adolescent project, but now I want to add it to level 1. This may affect other projects.
We could of course make separate tasks for each project, but still keep them in cytograph. However, we would then have to be very careful about task names. E.g. prepare_tissue_pool.py does not indicate which project it belongs to. We could name it prepare_tissue_pool_Adol.py and prepare_tissue_pool_Dev.py but this will be hard to enforce and will create bugs.
Bugs in tasks that prevent the code from running, would only affect one project, not all projects.

gioelelm commented 7 years ago

I am totally in favor of this. I think all the luigi part should be project specific and the different classes/functionalities should be shared and constitute cytograph proper. Cytograph classes and function should not contain internal calls to the luigi api (I think this is almost true right now: there should be less than a dozen of occurrences of Luigi parameters/methods calls) instead this should be passed to constructor, methods and function directly at the Luigi task level.

On Oct 29, 2017 3:19 PM, "Sten Linnarsson" notifications@github.com wrote:

Actually, maybe it would make sense to split out the project-specific Luigi tasks to their own Python projects (and their own GitHub repos), and let cytograph contain only the shared parts. This would solve a couple of significant problems:

-

Changes to some early luigi tasks (level 1, the classifier, etc.) currently affect multiple projects. It makes sense to have them be project-specific. For example, I have introduced the BalancedKNN in levels 2 and 3 for the adolescent project, but now I want to add it to level 1. This may affect other projects.

We could of course make separate tasks for each project, but still keep them in cytograph. However, we would then have to be very careful about task names. E.g. prepare_tissue_pool.py does not indicate which project it belongs to. We could name it prepare_tissue_pool_Adol.py and prepare_tissue_pool_Dev.py but this will be hard to enforce and will create bugs.

Bugs in tasks that prevent the code from running, would only affect one project, not all projects.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/linnarsson-lab/cytograph/issues/3#issuecomment-340265775, or mute the thread https://github.com/notifications/unsubscribe-auth/AEA92sfb_jr3_ZBLByAcaBpklNUlLnvEks5sxIlzgaJpZM4QKOaY .

linnarsson-lab / cytograph-dev

cytograph needs some basic documentation #3