Open nokome opened 4 years ago
I support the overall goal you've outlined above @nokome and think this is an overall improvement for project ergonomics. I do have a couple of statements and questions though:
A bit meta, but this seems more of a Hub concern than an Encoda one to me, or were you thinking that changes were required from Encoda's side as well?
However, it means that each time that Encoda's HTML encoding changes (e.g. a bug fix, or new meta data is added), or there is an upgrade to Thema or the components, it is necessary to create a new snapshot.
To be explicit, the proposal solves for changes with the output Codec. Enhancements to the JSON codec/Schema would still require regenerating the snapshot. Reason I bring this up is because I wonder if it's worthwhile storing the dependency version information as an (internal) meta file along with the snapshot. This would allow for any reproducibility or "version locking" features down the road. This would be on top of the hashed directory names.
This approach would have the advantage that snapshots would not need to be regenerated just because there was a change in the dependencies, and updates to dependencies would automatically be reflected in the content served to users.
Something to be conscious of is that this will create an ever growing directory of snapshots. Even if a dependency was updated for something unrelated to the HTML snapshot, we will end up needlessly invalidating caches for every project on the Hub.
I think both of the above problems have technical solutions (deletion of old snapshots, content hash comparisons for de-duping, auto regeneration of project main file as JSON, etc.), but depends on to which extent we want to solve them
A bit meta, but this seems more of a Hub concern than an Encoda one to me, or were you thinking that changes were required from Encoda's side as well?
Hah, no just mistakenly created the issue in the wrong repo. Have move to the right one now!
Enhancements to the JSON codec/Schema would still require regenerating the snapshot. Reason I bring this up is because I wonder if it's worthwhile storing the dependency version information as an (internal) meta file along with the snapshot.
Yes. The JSON codec is unlikely to change much (at all?, it is so simple, hence why we will use it as the snapshot format) but the decoding codec (e.g. when a Rmd is converted to JSON) will indeed be likely to improve and thuse may necessitate a new snapshot. But by using JSON instead of HTML we at least half the times that a new snapshot is needed just because a dependency is updated.
Regardless, I think it's a good idea to store the version information of Encoda used, the image in which the generation was done etc. I was thinking of doing that in the JSON and then creating a hash of the JSON as a "reproducibility certificate" / checksum / "fingerprint" for the snapshot.
Something to be conscious of is that this will create an ever growing directory of snapshots.
Yes I think a cron job to delete uneeded directories is required.
Even if a dependency was updated for something unrelated to the HTML snapshot, we will end up needlessly invalidating caches for every project on the Hub.
True, but the only solution to that I can think of is to manually update the Encoda version or hash when we think it is needed and that sort of defeats the purpose.
My comment in this ticket might be of some relevance here.
Not having a persistent snapshot would lead to published content being changed (fixed/added/changed as a result of bug fixes etc.), which is some cases would be helpful, but in others would not. So I think we would still need a mechanism to capture an ERA as is at time of QC by eLife staff.
Currently, when a snapshot is created we generate an
index.html
file from the "main" file in the project and store it in the snapshot. Thisindex.html
file is produced by a particular version of Encoda and is pinned to particular versions of@stencila/thema
and@stencila/components
. That is good from a reliability and page load speed perspective (no redirects to the latest version of those packages). However, it means that each time that Encoda's HTML encoding changes (e.g. a bug fix, or new meta data is added), or there is an upgrade to Thema or the components, it is necessary to create a new snapshot.An alternative approach would be to:
at time of snapshot, generate an
index.json
which contains the main file converted to JSON (with execution outputs embedded in it)generate a hash that represents the current versions of all the dependencies involved in generating and presenting content e.g.
when a request is made for
index.html
(or other formats, e.g. "Download as Word") check for a file in the subfolder with the name of the hash e.g.content/<project>/<snapshot>/<hash>/index.html
and serve that, if it does not exist then create it and serve it.This approach would have the advantage that snapshots would not need to be regenerated just because there was a change in the dependencies, and updates to dependencies would automatically be reflected in the content served to users.