pythontex for web publishing

TheChymera commented 7 years ago

Hey, I believe we had a similar discussion way back, on a tangent, but here it is in its own issue.

Long story short, I believe standardized self-publishing of single observations, pilot results, or research drafts is one of the next big things in scientific information-sharing. It would be really great if pythontex could help provide the infrastructure to do so; also, I believe it would be easy for us to provide support for more powerful features than one would have with any existing publisher.

In conjunction with a static site generator such as Pelican, I could envision figures and/or dynamic text being recompiled if dependencies have changed when the site is rebuilt. With a bit of tweaking Pelican could also be able to archive past versions of each page - so that research could be improved post-publication, in a well-documented fashion. Basically the website source could be managed on GitHub and edited collaboratively. Save for the PythonTeX part I have already implemented a clunky version of the above in Octopress - and it would be easy to make it even cleaner in Pelican.

I think the first thing to consider, however, is pythontex Markdown support. What's the status on that?

Also, do you know any hackathlons that would focus on publishing? If there is a lot of work to do (which I guess there is), it would be a great opportunity to get together and attract more people to help out. I was at BrainHack a few months ago, and people seemed to be really into the idea of a reexecutable publication - sadly, I did not spend much time trying to convince them to use PythonTeX, as I was working on package management. Still, I think the issue is better addressed in an environment dedicated to publishing.

gpoore commented 7 years ago

standardized self-publishing of single observations, pilot results, or research drafts is one of the next big things in scientific information-sharing.

I agree.

I'm not aware of hackathons focused on this sort of publishing, but they may very well exist.

In the short term, something like Pweave might give you something to work with.

In terms of PythonTeX for markdown or other text-based lightweight markup. I've been working in that direction on and off for at least the last 3 years. Here's a summary of what has happened and where things are.

PythonTeX now supports 7 languages. To make further extension and refinement viable, I need to transition from having all the templates as text in pythontex_engines.py to a system based on configuration files. I've started work on that.
Since PythonTeX currently works with only LaTeX, all settings are specified using the (more or less) standard LaTeX key-value syntax. This has some significant downsides, but it's what people expect and it's the easiest option. However, once code and related settings are in markdown instead, it no longer makes sense to specify settings using LaTeX syntax. So some sort of configuration language is needed.
Currently, the LaTeX side saves all code in the .pytxcode file, which uses a custom format. This usually works fine, but there are a few issues. If the code and settings were saved in a more standardized configuration or data format, it would be easier to add new features and solve some existing issues. It would also be easier to separate the core that executes code from the LaTeX-specific parts.

So all the different things I've been working on with PythonTeX, as well as some other projects, have ended up needing some sort of configuration. Sweave and Pweave use a very simple key=value system for settings, but there is a very small set of keys and values, so that's too restrictive for my purposes. In knitr, the values can be R expressions, which is undesirable for working with a range of languages. JSON is also too limited, because it doesn't allow multiline strings with literal newlines.

From the markdown side, some flavors like pandoc use YAML for document metadata or other configuration. I've looked at YAML, and it isn't an option for what I need for multiple reasons. I've also looked at TOML. However, the restrictions on newlines in inline tables (inline key-value pairs) make it unsuitable for the sort of inline configuration I need. It also doesn't allow any non-ASCII characters in unquoted keys and is still marked as unstable (though practically speaking, probably not much will change).

So over the past year, I've created a configuration language. I'm hoping to release that in the next couple months, but of course it's hard to predict exact time frames on these things. The main thing that is left is implementing a few advanced features. There won't be a release until everything is stable and essentially complete. Any new format has to justify its existence (requisite XKCD reference), and that's a lot easier to do when everything works.

As soon as the configuration problem is solved, then I will separate PythonTeX's code execution core into a separate project that will work with markdown and other text-based formats. I think the end of summer 2017 might be realistic for a release. But again, these things are hard to predict.

TheChymera commented 7 years ago

It sure looks like there is a lot of work to go around. I notice you are the sole contributor. Is this by design (i.e. do you want to keep it your thing only) or just by circumstance? If just the latter, I think it would be great to get more people involved. Not to mention that - large an effort though it may be - pythonTeX is only one part of what would be required for standardized, reproducible, self-publishing.

I for instance, am working a lot on package management ATM, and I'm still unsure how one might most seamlessly (implicitly, even?) integrate the list of required software into the body of the article. Dependencies of dependencies would be easy to handle via Portage.

We would also need a publishing framework (I suggested Pelican, but Pweave does look really nice) - and a way to toggle between the text view, and a source code view for commenting (collaborators/reviewers might also want to comment the invisible code, which makes the figures). Gitlab may be an option there?

Also, I met today with the CTO of Matters (@jonnyburger); and though the journal is by design following a non-distributed course, he also seems convinced of the huge advantages of Git-model distributed publishing system.

I have googled a bit for hackathlons dedicated to scientific publishing innovation, but found nothing. I suggest we might do something radical and organize one ourselves. I'm sure a lot of the Brainhack people would also be interested in participating. Basically the organizational overhead would be just determining the most suitable location, and convincing the university there (ideally one where at least one of us works) to grant us some space. Would you be on board? @obitus - how about you?

gpoore commented 7 years ago

I am the primary PythonTeX contributor largely by circumstance. Since it started out with a focus on Python and LaTeX, there was a limited audience to begin with. There are still a lot of features that could be added, but I think that I was able to start with a good enough feature set that there weren't a lot of people who were desperate for enough additional features that they were willing to create those themselves. (Parts of the project are also some of the earliest Python code I ever wrote, and so may not be the easiest to work with.) Early on, I would have wanted to keep the project sufficiently my thing so that I could get some publications out of it, but at this point I've gotten enough publications.

The other thing to keep in mind is just how much other software exists that is somewhat similar to PythonTeX. My first talk on PythonTeX was at SciPy 2012, which was also when the IPython (now Jupyter) notebook was really starting to take off. The R people have Sweave and more recently knitr, plus RStudio, etc. Then there's Pweave, which is like Sweave/knitr for Python. There's also the Beaker Notebook, which appears to be a derivative of Jupyter with some additional features. And of course Emacs org-mode has allowed some of these things for a long time.

I think PythonTeX offers superior integration with LaTeX, which was the objective, but the competition has many advantages, particularly when LaTeX isn't involved. I want to add support for other text-based formats like markdown, because that's something I would use myself and I think writing the code would be interesting. But at this point I'm really not sure how competitive a PythonTeX derivative will actually be in a non-LaTeX language. I won't know for sure till I build it.

I could be interested in some sort of meeting, but I can typically only manage 1-2 conferences or meetings per year, and already have tentative plans for SciPy 2017 in Austin, Texas. Although that might be a good place for some discussions.

TheChymera commented 7 years ago

We could maybe organise a sprint for a pythonTeX web-publishing draft, and let that be our test run? I think that would also help motivate us to finish the prerequisites until then.

I'll definitely look more into Pweave - and if nothing else, I'll have something to benchmark the new PythonTeX against. Also, I'll have to see how to best design a reproducible website workflow for Pelican (as I see it, both Pweave and PythonTeX would need to be integrated in some website for navigation). One thing I can already promise to have is a number of virtual machines with highly configurable package management and lots of scientific packages, so that we can test highly diverse USE cases.

Do you think you could set your course for June 2017?

TheChymera commented 7 years ago

@gpoore submissions for SciPy 2017 are open and close next week. Want to organize a pythontex/self-publishing sprint?

https://scipy2017.scipy.org/ehome/220975/493426/

gpoore commented 7 years ago

@TheChymera A sprint might be good, but I will have to think about my schedule a little more. Next week is the deadline for sprints that are on the calendar (which would be good), but isn't the deadline for actually organizing sprints. Also, topic-specific meetings are often organized at these conferences (I think they're often called "birds of a feather" or something like that), so a shorter topic-specific meeting could be an alternative to a sprint, depending on objectives.

Are you, or anyone else you know who uses PythonTeX or is interested in these kinds of things, planning to be there? Having some ideas on numbers could be useful.

What would you want to accomplish with a sprint? How much of that ties into LaTeX, versus other markup? My impression is that many people use Jupyter notebooks for this sort of thing, and some are using org-mode or Pweave or some of the other tools I've mentioned. So it would be important to be clear about the objectives of a sprint, and how the objectives relate to what already exists in other tools.

TheChymera commented 7 years ago

Next week is the deadline for sprints that are on the calendar (which would be good)

This could definitely bolster the numbers and raise the profile. I think adoption is a big issue in the success of a FOSSS project, and the playing field for reproducible publishing I feel is still very fluid. Getting a lot of people on board could mean we get the reach and good quality contributions (resulting in a better project and more contibutions etc) that might otherwise go to other projects.

"birds of a feather" or something like that

According to the web page, these are more like panel discussions. I couldn't find any yt videos of those sessions, but I suspect that if we want to get some nice work done as opposed to brainstorm, philosophize - or anything else that may be better done after hours - this may not be the right format.

Are you, or anyone else you know who uses PythonTeX or is interested in these kinds of things, planning to be there?

Me definitely (I am also submitting two other projects, but even if only the sprint gets accepted, I'll join). Nobody else I know well. Maybe some brainhack people.

What would you want to accomplish with a sprint?

Introduce features to make pythontex (or whatever you want to call the refactored version) more usable for web-self-publishing, e.g.:

add (better) support for HTML output
introduce (better) separation of syntax between files
require fewer files to be maintained by the user
support more figure generation systems (e.g. those of biopython and graph-tool)
set up a demo version-tracked, reproducible, self-publishing website using pythontex to generate up-to-date scientific content whenever the website is regenerated.

I think the last point in particular is actually the most important. However cleanly (or not) pythontex will end up handling the issues associated with self-publishing, what will make the greatest difference in usability and adoption is if people can easily get a demo structure into which they can just plug their content.

gpoore commented 7 years ago

That all sounds pretty reasonable. I guess the main thing I need to look into is how far PythonTeX (or, really, the successor for general markup) can get by that point. I'll try to get back to you on this in the next couple days.

I'm currently finishing up the config language that will be part of refactoring PythonTeX and supporting non-LaTeX languages. That will be out next week, and I'll also be submitting an abstract to the SciPy conference.

TheChymera commented 7 years ago

next week, and I'll also be submitting an abstract to the SciPy conference.

You mean for a talk/poster or for the sprint?

gpoore commented 7 years ago

I should have been more clear about that. I will see what I can come up with in terms of sprint options/objectives. And I am submitting an abstract based on the config language for a talk/poster.

TheChymera commented 7 years ago

I really hope we can get this sprint together, Let me know if I may be able to help with anything.

gpoore commented 7 years ago

I think I can stay for at least one day of sprints. I looked back at the conference page, and I'm not seeing a deadline for sprint submission, so I don't know if they changed the page or I was looking at the wrong thing. In any case, it looks like the sprints aren't posted until June, so it looks like we have some time to think about options, which is good.

My config language is out with basic features (https://bespon.org/), so things will soon be in place for refactoring PythonTeX and working toward supporting other markup languages. However, realistically, I probably won't be able to do any real work on that until at least the middle of May. So it's hard to predict how far things could be by the conference.

In terms of thinking about possible sprint topics, if you could list/describe some of the specific features you want for publishing (especially things that don't currently exist, or don't exist in a good form), that might help narrow topics or directions. For example, what do you want that doesn't already exist in something like Authorea?

gpoore commented 7 years ago

@TheChymera Are you planning on going to SciPy, and if so, are you still potentially interested in working on something during the sprints? I've finished grading final exams, so I can start thinking about programming again.

TheChymera commented 7 years ago

@gpoore I'm going! And I continue to be totally interested in doing some sci-publishing work during the sprints.

I haven't gotten back to your question in no small part because the project I'm presenting turns out to still need quite a bit of work, and that work turns out to require quite a bit of time (I find it hard to believe that I didn't foresee this....).

But I'll get on this after next week. Any idea when the sprints are? in the July 10-16 timeframe? I'm looking to buy my tickets soon.

gpoore commented 7 years ago

@TheChymera Sprints are July 15-16 (Saturday-Sunday). I will at least be there through part of Saturday, and possibly through part of Sunday...also working on travel plans.

More discussion later is fine. I'm also busy working on conference preparation.

TheChymera commented 7 years ago

@gpoore

In terms of thinking about possible sprint topics, if you could list/describe some of the specific features you want for publishing (especially things that don't currently exist, or don't exist in a good form), that might help narrow topics or directions. For example, what do you want that doesn't already exist in something like Authorea?

The issues (independent of any particular features, even) I see with Authorea or similar platforms are:

Their infrastructure is any combination of not-free, not-open, or not-reproducible. Ideally one would have a system which anybody can run entirely locally. This would not only give researchers more freedom and autonomy, but it would also allow the infrastructure to be used by any number of commercial service providers / publishers / universities and lead to higher adoption.
They are GUI-centric. While GUIs can be great for service providers, they are of little use to technology development, and generally don't age well. Ideally a truly distributable infrastructure would have a barebones API, which developers develop also/mainly for themselves. If anybody (including e.g. the original developer) wants to sell some service based on the infrastructure, they can of course write a beautiful GUI on top of it.

Now to some of the features:

What I really like about PythonTeX (and I have failed to find anywhere else) is that it offers a more-or-less uniform way to include dynamic plots in all manner of documents. One should be able to include a dynamic figure in the same way in a paper, book, presentation, website, or poster. Ideally the python function code would stay entirely the same for all outputs, and one could customize the appearance only in the .sty or .css files (depending on the output) and document-specific matplotlibrc.py files. Regarding the aforementioned example, we would need a way to not have to specify the figure sizes in the same place where we define the python analysis functions, but perhaps in the document, where we call them.
I would be looking to try and minimize the boilerplate code. Currently, in PythonTeX seems to require quite a bit of it. All of the functions in functions.tex could be provided in a nicer way internally by pythontex.
There are a number of really nice plotting libraries (graph-tool, biopython, sadisplay, etc.) which behave rather differently than pure matplotlib. It would be nice if we could better support some of them.
Ideally one could separate e.g. LaTeX and Python code entirely across documents, perhaps it is possible to just define plotting functions (entirely layout-agnostic) in the Python code file, and call the functions by name and with size parameters in e.g. the LaTeX document.
There should be support for data dependency specification (also in the form of directories).
It would be really cool if python dependency management could be handled (more) automatically. Perhaps it is possible to just get the path from all import statements and automatically track those files as dependencies.

I think what would really be a killer in terms of sprint output would be an example repository, where we provide some dummy data, a comprehensive (but if possible, small) list of deps, define plotting functions (once and only once for all outputs), and provide the entire source for creating a pdf poster, pdf paper, pdf presentation, and HTML paper based on them. If it's easy enough to just clone it and fill it in with your own content, I can see it getting a lot of traction.

TheChymera commented 7 years ago

In neuroscience in particular, reexecutable publication is getting a lot of traction, see this very recent paper https://f1000research.com/articles/6-124/v2 (succeeding the aforementioned brainhack efforts).

The reexecutable source code doesn't contain the paper text, but ideally all of the reproducible analysis could be wrapped and triggered by the document build system via pythonTeX.

gpoore commented 7 years ago

@TheChymera Thanks for the comments on other tools.

In terms of your comments on PythonTeX features:

Figures: Setting things like dimensions on the LaTeX side could be nice and should be doable. At the same time, at some point trying to separate layout and content may break down. For example, I often change fonts, font sizes, scale, etc. for plots depending on whether they are for print, web, or poster. I think the solution may be some sort of plugin system. A basic plugin for a plotting library could split simple things like dimensions off onto the LaTeX side, while a more sophisticated version would do much more automatically, potentially event adapting for output format.
Data dependency specification: the current add_dependencies() can be extended to support directories, and something similar can probably be created to work with remote data. Are you thinking about more beyond this?
Python dependency management would ideally use an external, pre-existing library. In the ideal case, it would be possible to get a list of non-builtin imports and track __version__ for all of them, but I've run across several non-builtin libraries recently that lack __version__, which could mean falling back on hashing. Also, there's always the possibility of local or lazy imports, so those would have to be handled correctly.

Over the next week or so, I'm going to look into separating the code execution core (or at least a basic part of it) and getting it running with Markdown. I think that will probably be a faster way to move forward and add new features, because it will remove all the overhead on the LaTeX side. Then I can go back and update the LaTeX side once the new and improved code execution core has stabilized.

TheChymera commented 7 years ago

I often change fonts, font sizes, scale, etc. for plots depending on whether they are for print, web, or poster.

To use the example I gave above where we set up a repository which can reproduce a poster, a presentation, and a paper with the same figures in appropriate styles I say that this can be adequately addressed by using a per-document matplotlibrc file and allowing the setting the size per-figure. I assume that only the size may have to be controlled independently in each figure within one document, while fonts should rather be kept consistent in the entire document.

add_dependencies() can be extended to support directories, and something similar can probably be created to work with remote data.

This would be great!

I've run across several non-builtin libraries recently that lack __version__, which could mean falling back on hashing.

I was actually thinking of hashing all the way. Most python module files are small, so I don't see a major drawback associated with hashing. If anything it's a lot more robust to detecting changes. Regarding unavailable __version__, I myself also use live versions of packages I develop, and some of them lack that attribute. If PythonTeX could determine the path of import mypackage to be e.g. /usr/lib64/python3.4/site-packages/mypackage/, and hash that, that would be really great.

gpoore commented 5 years ago

This issue started with a focus on web publishing. I now have a new project, Codebraid (https://github.com/gpoore/codebraid/) that is basically PythonTeX for Pandoc Markdown. Codebraid doesn't yet have dependency tracking and some other PythonTeX features. It does have support for Jupyter kernels, which brings some nice automatic plotting capabilities.

Future discussion of PythonTeX-style web publishing will be more appropriate for Codebraid rather than PythonTeX, so I'm closing this issue.

TheChymera commented 4 years ago

@gpoore understood, might I ask what the rationale was for splitting them up? Just curious from the environment design and dependency management (for the software itself, not for the scripts run by it) point of view.

gpoore commented 4 years ago

@TheChymera PythonTeX is on CTAN and accessible through LaTeX package managers, which is good as long as it is a LaTeX package, but would not be ideal if it also had Markdown capabilities. If you want to run code in a Markdown document and produce HTML, being forced to install LaTeX or do a manual software install without some sort of package manager isn't ideal. Having a standalone Python package available through PyPI and Conda Forge is a better option.

The goal is for Codebraid to become a general-purpose library for running code, manipulating code (for example, extract snippet based on regex), tracking dependencies, etc. It currently provides an interface to Pandoc Markdown. Eventually I hope to add a direct LaTeX interface so that PythonTeX and minted (or something like them) can use the improved capabilities.

gpoore / pythontex

pythontex for web publishing #96