Pipeline optimizations - Githubissues

danopolan commented 1 month ago

With increasing development efforts, we are starting to use all 3000 minutes of CI per month (last month we've used almost 80%), so I would like to optimize the pipelines to save some minutes.

This is a brainstorming issue to collect the ideas. It's not urgent for implementation.

Ideas:

By default build only PDF files for commits
Build all other files (EPUB, HTML, DOCX) if enabled in metadata.yml under separate options like docx-ouptut: false and default to be false.
Optimize execution times where possible (e.g. LibreOffice installation)
Employ more checks before LaTeX build to prevent failures (MD syntax, references and links, Unicode chars) or suggest how to run some checks on a local machine.
Prepare a guide on running builds locally to prevent failed CI

Witiko commented 1 week ago

By default build only PDF files for commits

Build all other files (EPUB, HTML, DOCX) if enabled in metadata.yml under separate options like docx-output: false and default to be false.

DOCX should not be a bottleneck, as the conversion to DOCX finishes in a couple seconds as opposed to the other steps, which can take minutes.

We currently don't build EPUB or HTML for ⟨document⟩.tex if a file named ⟨document⟩/NO_HTML exists in the repository. However, this is an opt-out mechanism, which is also quite obscure and unknown to most people except myself. Having an opt-in metadata field seems a better and more visible solution.

Optimize execution times where possible (e.g. LibreOffice installation)

It might make sense to create pre-built Docker images in this repository, which would include LibreOffice and would then be downloaded during CI. This image can also be significantly smaller than the image that we currently use.

Employ more checks before LaTeX build to prevent failures (MD syntax, references and links, Unicode chars) or suggest how to run some checks on a local machine.

Prepare a guide on running builds locally to prevent failed CI

There is a limit to how complex the code into the GitHub Actions YAML file can be before it becomes difficult to maintain. Extracting the CI code into scripts should make this limit much higher and allow us to both 1) perform more advanced checks on the source code, 2) react to values in metadata.yml from the CI (such as docx-output: false), and also 3) run builds locally.

Ad 1) As discussed in https://github.com/istqborg/istqb_shared_documents/issues/65, few tools enable the static analysis of Markdown files. However, I can write scripts that would collect all Markdown documents used in ⟨document⟩.tex, convert them to abstract syntax trees with Pandoc, and then ask questions such as:

If \<#section:⟨identifier⟩> or [⟨link text⟩](#section:⟨identifier⟩) appears in a document, is there a corresponding section with attribute #⟨identifier⟩ in any document?

I can then skip the compilation if I find issues with the document.

Witiko commented 1 week ago

Build all other files (EPUB, HTML, DOCX) if enabled in metadata.yml under separate options like docx-output: false and default to be false.

While having many metadata fields like docx-output, epub-output, html-output, and line-numbers (from #54) makes it easy to configure the build for authors, it may still make sense to have sensible defaults based on whether the document is under review or released, as discussed in https://github.com/istqborg/istqb_product_base/issues/54#issuecomment-2154510982. For example:

If version: release, then set the following defaults:

docx-output: false
epub-output: true
html-output: true
line-numbers: false

Otherwise, set the following defaults:

docx-output: true
epub-output: false
html-output: false
line-numbers: true

As an aside, we may want to add an extra section to the documentation that would describe the supported metadata fields, how they should be used, and how they impact the document. The schema keeps growing and it no longer seems intuitive. In the long-term, we may also want to describe the other types of YAML documents such as language and question definitions.

danopolan commented 1 week ago

Regarding 1) and 2) You are right, that DOCX is not a bottleneck, but it is not needed for regular output. But we can keep it building all the time for now. Adding some abstraction above output formats and line numbers is cleaner but less intuitive for users. This project is not reflected in ISTQB working processes yet, so I would like to keep full control over the users and add the abstraction later after we decide on detailed processes.

One more thing is that if we could skip building of files, we do not touch (e.g. Body of Knowledge, Accreditation Guidelines, Sample Exam) within the branch. In the the TA, we have split the Syllabus and Sample Exam into two separate branches and PRs, so we need only the syllabus to be built in the syllabus branch and only the exam in the exam branch. But currently, we are building it all, since templates and repos created out of it have it all.

Regarding 3) Docker image in this repo is a good idea.

Regarding 4) and 5) Refactoring CI into scripts is a good idea with many benefits.
Static analysis should be discussed in greater scope since we want to add specific checks for ISTQB rules before building. We should agree on a solution that would allow this as well.

Witiko commented 1 week ago

One more thing is that if we could skip building of files, we do not touch (e.g. Body of Knowledge, Accreditation Guidelines, Sample Exam) within the branch.

Since building all files still seems useful for the main branch, perhaps we can use a different logic for CI triggered from a pull request and only build documents that have changed in the PR.

istqborg / istqb_product_base

Pipeline optimizations #49