iipc / warc-specifications

Centralised repository for WARC usage specifications.
http://iipc.github.io/warc-specifications/
99 stars 30 forks source link

Add GitHub Action to Generate a WARC of Hosted Site #66

Open ikreymer opened 4 years ago

ikreymer commented 4 years ago

Inspired by @anjackson's tweet, here's a github action that will generate a WARC of the github pages site after every commit to master. I figured having a WARC for every commit of the WARC specification might be a good test case for this idea!

This PR adds an action that builds the site via Jekyll and then generates a WARC using warcit and uploads it as an artifact to github, like this: https://github.com/ikreymer/warc-specifications/actions/runs/170823529 (Note that due to limitation of github, the artifact is always also zipped, so that WARC file is placed in a zip file - can't be changed for now).

This PR also adds:

(The github api to list active issues an of course the active issues themselves are not included, which might be a nice future extension...)

ibnesayeed commented 4 years ago

It is worth noting that these artifacts will not be preserved forever. They expire after 90 days automatically. Also, there might be some disk quota associated. If we had an external storage where these WARCs can be pushed as the next step after artifacts are built, that would be great. Also, it will be better to add timestamp in the filename.

ikreymer commented 4 years ago

Yeah, timestamp is a good idea.. Maybe there should be a separate workflow for turning the artifacts into releases, which would be permanent.. perhaps on a version change?

ibnesayeed commented 4 years ago

Maybe there should be a separate workflow for turning the artifacts into releases, which would be permanent.. perhaps on a version change?

I don't think this repository is tagged/versioned, but if we plan to do that every now and then after major changes, uploading workflow artifacts as release artifacts would be a good idea.

On the other hand, one can always recreate a WARC file of a prior state by checking the code out at a specific repo state, building the site, and running warcit on it.

ikreymer commented 4 years ago

Latest update uses [user]-[repo]-[timestamp].warc.gz as the filename: https://github.com/ikreymer/warc-specifications/actions/runs/170913379

ikreymer commented 4 years ago

This, unfortunately will not work on prior commit states as Gemfile will be missing. However, if we do not mandate inclusion of Gemfile in the repo and install all the ruby dependencies inline in this workflow file then it should work on historical versions as well.

I suppose you can check if Gemfile exists and, if not, create it on the fly.. I was thinking of this is a prototype for a more generic workflow that could be added to any repo, including non-Jekyll static sites. So probably it should check: 1) if no _config.yml, then not a Jekyll site, just warcit the root repo 2) if _config.yml but no Gemfile, try adding a default one and building it before running Jekyll 3) if Gemfile exists, just run Jekyll.

There may be more variations too, like if the gh pages root is in the docs directory. Or maybe that should be a future PR/improvement.

ibnesayeed commented 4 years ago

I suppose you can check if Gemfile exists and, if not, create it on the fly..

If we were to do that, then it would be simpler to not rely on a Gemfile and have all the packages necessary to replicate default GH Pages builder.

I was thinking of this is a prototype for a more generic workflow that could be added to any repo, including non-Jekyll static sites.

In that case you should be able to ask users to provide input variables to identify which category their site falls under while having a more sensible and common default. There are a handful of reusable actions to host static sites on GH Pages, built from many different static site generators.