Refactoring: use a template library to create formatted output

xrotwang commented 8 years ago

Rather than stringing e.g. HTML tags together in long python functions, we should use a template library to separate format from data structure, and improve testability of output generating code. My recommendation would be Mako, because

it has almost no other dependencies
it is pretty fast
it allows python code in templates
it can be used to format non-XML, non-HTML formats equally well
it's used in all clld apps :)

LinguList commented 8 years ago

Agreed, now I also see what you mean by the template system.

Just as a short overview here, regarding different formats that are currently produces:

Phylip distances format (used as input for Phylip to carry out neighbor joining etc., but also convenient for SplitsTree, although two different versions of the format exist, with Phylip being restrictive and SplitsTree permissive)
Newick format for output of one or more trees
Nexus format for output of presence absence patterns to be used in Mr Bayes and other software (basically, nexus should/could also include trees, distance matrices, and all the stuff)
gml format for graphs, for a couple of reasons GML is useful, since it can be read by most software (cytoscape, gephi, etc.) using graphs, and is also supported by networkx and igraph, which is not true of other graph formats
json for various points, but this is probably clear, although it is not often used at the moment in LingPy
scoring functions, there is currently no real format, but it is important to handle the scoring of segments, and a lingpy-specific format is defined that is used to store scorers and can also parse them (convert.strings.scorer2str() and read.phylip.read_scorer()).
msa and psa, two formats for multiple alignments, psa being rather strict, but MSA having some variants, with and without IDs, etc., which make parsing a bit annoying
html, as certain outputs are offert to be directly written to html, especially aligned strings in cognate sets, etc., but also alignments, and even wordlists in very flexible formats, using the wordlist.Wordlist.export function (this again is based on a dictionary output of the wordlist, defined in basic.ops.wl2dict)
tex, there is also tex-output for alignments, and it can also theoretically created using the export function mentioned above
csv (!!!), I almost forgot this...
tsv/qlc as the lingpy-wordlist format

Apart from that, we have all the plots. But here, I guess we need to distinguish between a plotter and a writer, since the plots visualize the data, while the writer still displays the data, although, with tex as one of the formats, the borders are not really strict here...

xrotwang commented 8 years ago

@LinguList thanks for this list! Makes for a nice task description for thursday :)

LinguList commented 8 years ago

@xrotwang, I'll probably will still add more things to the list later (just realized that I had forgotten csv, and also the basic wordlist format...)

xrotwang commented 8 years ago

@LinguList A first example of using Mako to create output can be seen here: https://github.com/xrotwang/lingpy/commit/069b71d84527f036864c21ede2257e21f9311d7e I think it's worth the additional requirement (i.e. the mako library). What do you think?

LinguList commented 8 years ago

I agree that it's worth the additional dependency. First, since the templates will be much easier to handle now. Second, since the documentation will also be easier (one can say that there are basic types with args and keywords for the classes, and users with high ambitions will need to turn to mako for creating their own templates), and third, it will be much easier to customize things like, e.g., different nexus styles, different distance matrix outputs, etc., and although we have only a few of those variants in the library at the moment, there are many more out there (also for the handling of scoring functions, depending on biopython or other libraries), and handling them by hard-coding will just be a pain.

LinguList commented 8 years ago

One thing I was just thinking about is the question: when would we use templates for writing, and when would we need to go for other stuff? The point is the following:

we have simple stuff, like csv-writing, where using python.csv is probably the best
we have json as a format that also has automatic support for writing / rendering
we have the more complex or user-defined things for which we need templates, like phylip.dst-format, nexus, newick, etc.
we have plots where we pass to matplotlib, since they cannot be handled in a template

The borders between, say, "text-file" and "plot", are, however, with the html, also the latex-support for MSA files, not completely clear-cut, as one could see the TEX-export as a plot, and HTML is supposed to be treated as a plot, that is, as something stable that one does not further modify, and which is for looking at it, not for modifying it manually.

So I'm asking myself, how to best think of these things, that is, text-export, hybrid-html-tex-export, and plots. Should we officially treat the hybrid exports as text-export (also in the documentation), or should we make a distinction between file output and, say, html-output?

There are three possibilities:

text-file + html/tex/etc. as "text-export" and plot as separate export (wordlist.output, wordlist.plot)
textfile, hybrid, and plot as three separate things (wordlist.output, wordlist.export, wordlist.plot)
textfile as separate and plot and export as one thing (wordlist.output, wordlist.plot)

This might be useful also for documentation purposes to have it somehow fixed and used similarly across all methods. I would tend to go for the distinction between output/export/plot, with output pointing to formats that can be read in again, export to formats for presentation, and plots to graphics. Does that make sense?

xrotwang commented 8 years ago

I think the best unit for pluggability when it comes to output is the adapter, i.e. a piece of code defined by the kind of object it adapts and a mimetype it adapts to, e.g. text/tex or maybe image/png. Whether it does so using a template is secondary. Am 19.03.2016 09:35 schrieb "Johann-Mattis List" notifications@github.com:

One thing I was just thinking about is the question: when would we use templates for writing, and when would we need to go for other stuff? The point is the following:

we have simple stuff, like csv-writing, where using python.csv is probably the best

we have json as a format that also has automatic support for writing / rendering

we have the more complex or user-defined things for which we need templates, like phylip.dst-format, nexus, newick, etc.

we have plots where we pass to matplotlib, since they cannot be handled in a template

The borders between, say, "text-file" and "plot", are, however, with the html, also the latex-support for MSA files, not completely clear-cut, as one could see the TEX-export as a plot, and HTML is supposed to be treated as a plot, that is, as something stable that one does not further modify, and which is for looking at it, not for modifying it manually.

So I'm asking myself, how to best think of these things, that is, text-export, hybrid-html-tex-export, and plots. Should we officially treat the hybrid exports as text-export (also in the documentation), or should we make a distinction between file output and, say, html-output?

There are three possibilities:

text-file + html/tex/etc. as "text-export" and plot as separate export (wordlist.output, wordlist.plot)

textfile, hybrid, and plot as three separate things (wordlist.output, wordlist.export, wordlist.plot)

textfile as separate and plot and export as one thing (wordlist.output, wordlist.plot)

This might be useful also for documentation purposes to have it somehow fixed and used similarly across all methods. I would tend to go for the distinction between output/export/plot, with output pointing to formats that can be read in again, export to formats for presentation, and plots to graphics. Does that make sense?

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/lingpy/lingpy/issues/211#issuecomment-198668162

LinguList commented 8 years ago

I agree.

lingpy / lingpy

Refactoring: use a template library to create formatted output #211