thecapitalistcycle / covid19model

Code for modelling estimated deaths and cases for COVID19.
MIT License
0 stars 0 forks source link

Pipeline to Explorable Explanations and Data Journalism #2

Open dentarthur opened 4 years ago

dentarthur commented 4 years ago

Pipeline

Hi again @harrisonzhu508 and hi to @am5113

Some hasty notes re:

https://github.com/ImperialCollegeLondon/covid19model/issues/11

and

https://github.com/ImperialCollegeLondon/covid19model/issues/6

See warning at end re my (lack of) competence. I am only linking to these notes from original issues, to keep out of the way.

The number of forks is growing daily from the index repository at rates comparable to covid-19 transmission, even without looking at the second and later generation forks.

It won't be necessary for Imperial College team and close contacts to work directly on data visualization. There will be floods of people doing that.

Higher priority to rapidly setup pipeline from underlying Imperial College response team releases to the various stages between that and data journalism with "Explorable Explanations". This will need some organization of curation and even "triage" to avoid overwhelming staff at Imperial College while helping provide some guidance to people who will certainly gallop off in all directions whether guided or not.

My guess is that pipeline will run fastest through these stages:

  1. Python/PyStan code with no dependency on R to enable large numbers unfamiliar with R to start setting up Jupyter notebooks using SciPy tools they are already familiar with and get to grips with PyStan plus the recommended texts on Bayesian models from Stan docs and epidemiological concepts understood by the audience to whom the Imperial College papers are addressed. That is already enabled by the current release and hopefully only needs contact points to put people in touch with each other in order to speed things up. Epidemiologists and others in many countries and regions will be adapting what has already been released to their local situation and local data as well as updates on common parameters such as distributions for infectiousness from exposure. Much larger numbers will be able to get to grips with the models in python than can do so using specialist tools dominant among epidemiologists.

Separate repository should be kept in sync with Imperial College response team work by unit tests that compare exact reproducible inputs and outputs between R and python versions (in addition to the regression tests on R version).

Notes below are about subsequently reaching a very much wider audience via stages in a pipeline from there. Aim is to avoid impeding the flow from one stage to the next.

  1. Python Notebooks and Dashboards that can run from servers similar to R using Shiny. I think the bottleneck will be server bandwidth running PyStan that could soon overload server capacities given that several billion people are already more than casually interested and stuck at home on the internet. I suggest encouraging use of Bokeh because the direct static links between widgets on the javascript side are a step towards a necessary subsequent "serverless" static web platform pages with effectively unlimited bandwidth from content distribution networks as with github pages etc. Bokeh also runs from the same Tornado servers used by Jupyter notebook hubs.

Most Jupyter python visualization software relies on javascript for display but links between widgets and figures require server round trips. That should be avoided in a pipeline towards end user browsers.

There may also be something similar in Bloomberg's bqplot and their team might be able to help:

https://github.com/bloomberg/bqplot

Holviews sublates Bokeh and looks to me the most productive starting point for visualization work:

https://github.com/holoviz/holoviews

BTW people should quickly find their way from holoviews to wider PyViz ecosystem, connected to still wider PyData ecosystem which connects via NumFocus to the ecosystems for HPC GPU clusters etc (eg via Dask and PyArrow, SYCL, DPC++ and kokkos).

https://towardsdatascience.com/pyviz-simplifying-the-data-visualisation-process-in-python-1b6d2cb728f1?gi=281f80ffb210

NumFocus et al should be be able to mobilize very skilled teams on request from Imperial College team and UK SAGE etc.

  1. Python based "apps". People who don't already use Jupyter notebooks can get familiar using Nteract (Electron based) app which just installs like any other PC application for Windows, Mac or Linux. No need to understand pip, conda, docker etc, just a PC user. This gives the same access to Jupyter notebooks and Dashboards using PyStan without any load on servers. Again there is a team there that could do the work.

  2. "Serverless" static web platform pages. One route might be via pyodide which could actually compile PyStan to web assembly (it already runs numpy and pandas). But this is very bleeding edge alpha and carries the burden of much longer initial load times (entire cpython, numpy and pandas loaded from server to browser). Those that can do it will, while others should not wait for them to finish but proceed along more mainstream lines.

  3. Mainstream route is via conversion from Jupyter python notebooks to standard web platform javascript/HTML5 eg as at:

https://observablehq.com/collection/@observablehq/coronavirus

That collection is also growing rapidly (with independent visualizations not just forks). It uses technology suitable for "Explorable Explanations" but I think most of the output so far is just visualizations with some animation rather than being really Explorable.

I suggest that particular link is the main target that should be aimed at - via the necessary intermediate pipeline stages from the "real" epidemiologial models via Jupyter Python notebooks and Dashboards and apps.

There is plenty of overlap between pythonistas and web platform developers - far more than from R and epidemiology tools despite Shiny doing Dashboards earlier.

  1. That large and growing "Explorables" community strongly overlaps with data journalism that can get visualizations with widgets and perhaps also Explorable Explanations into mainstream media eg using simpler tools like idyll markup:

https://github.com/idyll-lang/idyll

Bottleneck

My guess is the main bottleneck will be separating the visualization/animation/exploration that can be done well within an end user browser from the underlying datacrunching in PyStan (not to mention the stuff that requires HPC GPU clusters).

I am assuming that apart from pyodide the eventual explorables running on end user browsers have to be manipulating an emulation of the effects of varying parameters in the model rather than running those models themselves as the regional epidemiologists and others will be doing.

There would be lots of python data scientists who know how to train simplified models to emulate the behaviour of a more complex system using methods that I have no idea about. I suspect those methods will often be quite inadequate eg compared with Imperial College Reports 9 and 13. There are far more complex systems theorists who can emulate an SEIR model than who actually understand what they are emulating and they often get epidemiological concepts wrong.

I assume the data needed for training such simplified models would consist of paramater sweeping runs of either the PyChain models or the underlying report 9 HPC cluster model dumped to NetCDF hdf5 files and imported with PyTables.

I am guessing that it would be important to establish some QA tests so that the behaviour of a simulated model could be compared with the same parameter variations on the real model and given a score for accuracy. That could need some supervision from Imperial College staff.

Beyond the Bottleneck

The numbers currently locked up at home and Susceptible to Infection with a desire to produce data visualizations is probably comparable to the total users of GitHub - millions.

Visualizations and Data Journalism so far has been rather limited with excessive focus on current and local stats that do not actually help to understand much. What is needed is "Explorable Explanations".

The two 1 hour media conferences 5 days apart in response to earlier models and the Imperial College report 9 following the Italian catastrophe were rather spectacular. Rapidly shifting UK policy was explained by the Chief Medical Officer, Chief Science Officer and Prime Minister using handwaving and gestures towards wall charts. If they had an Explorable model they would have been using it for a better attempt to convey what they were talking about to the uncomprehending journalists.

Nearly a month later the same is true from what I have seen of media conferences in USA and Australia.

Focus on eliminating bottlenecks in the pipeline and protecting the time of the core team rather than doing the work that many others can run with.

Warning: LARK'S VOMIT

I do not know what I am talking about and do not even have novice level "hands-on" competence to help in any of the areas mentioned.

These notes are entirely based on preliminary "breadth-first" reading for a future project on providing an Agent Based stock flow consistent microfoundations for business cycle models based on Pavel Maksakovsky's work on "The Capitalist Cycle" (and dreaming of a massively multiplayer online simulation game using GPUs on gamer PCs).

My only indirectly relevant skill is at "breadth-first" ie superficial, preliminary reading - aka procrastinating or perhaps philosophizing.

Still, even the errors and misconceptions inevitably arising from that can sometimes be helpful for people busy fighting on the frontlines. At least it might help with clarifying what those who know better must tell others or warn them against.