Caching Proposal & Ideas

In the last weeks I've been looking at the issue of caching, from different angles.

I've been thinking that caching will really have to deal mainly with resources — ie, validating the resouces and extracting their key-value pairs from the comments, and cache the pairs and the HTML resume card produced from them.

It seems to me that it would make sense to hide the caching system inside the module that will deal with resources. Also, I think that any operation on a resource file should silently run all checks and extractions for the resource, and cache the result. This would be handled behind the module scenes, and the user shouldn't worry about it, but simply use the module functions as required.

Here is an example of how this could work...

Suppose the resources module (res::) exposed different procedures, something like:

res::Check( resfile.s ) (returns true/false)
res::ParseHeader( resfile.s ) (returns a list of key-val string pairs)
res::CreateCard( resfile.s ) (returns the HTML card)

Regardless of the fact that each of the above procedure has a different aim, and returns different results, the module should carry out some basic cache operations for each of them:

Check if resfile is already cached:
- IS CACHED:
  - return expected result from cached copy
- IS NOT CACHED:
  - run all checks
  - auto-fix resource problems that can be handled automatically? (eg: strip away settings at end of a PB source or inc file)
  - extract key-vals list
  - create HTML card
  - save all results to cached file

This approach means that caching is never a separate operation but is integrated in every single resource access. Also, it means that the user won't have to worry about carrying out integrity checks on a resource before using it, since these will always be implicitly handled by the cache system.

As for the caching implementation, I think that using MD5 algorithm to detect if a file has changed should be fine enough. The cache should work as follows:

Save cache files in .cache folder (git-ignored).
Each module that needs caching will use an independent cache subfolder (because it will delete cache folders created by older versions of itself).
Save resources cache in a subfolder that has the res module version number in its name (eg: res_1.0.2)
The res module will delete any cache subfolders created by older versions of itself (because the criteria of its operations might have changed, and therefore also the results could be different even if the res file is the same). This is a safer approach, even if it means that every time the res module is updated it will have to rebuild its cache.
A resource file is cached ONLY if it passes all integrity checks — when the res module doesn't find a cached file for a given resource, it means that either it wasn't yet examined (a new resource) or that it failed tests the last time it was checked.

Note that the last point also implies that if a cached file exists for every resource then it means that all resources are good to go — which provides a very fast way of checking if the project is ready for updating the website pages.

As an example, let's say that we use res::CreateCard( resfile.s ), and that resfile has correct header comments but was saved with compiler settings at the end of the source. In this case, the procedure will return a well-formed HTML card string, but the resource will not be cached because it failed the integrity check for presence of source-settings. The user doesn't know this (neither he should, nor he cares), but this is an example of how the cache works in the background.

The fact that any type of resource operations via the res module will implicitly carry out all tests and extraction for that resource, and then cache the results, also means that the since the cache is gradually built in the background it should never be too slow to run checks on all resources, since many should be pre-cached due to other operations.

I've also thought about what else could be cached, but I think that resources are the only bottleneck here. HTML pages are tricky to cache because any change in the HTML module, the pandoc template, the global YAML settings file or even a Category folder being renamed, all these could potentially affect the final HTML result. With the HTML pages creation process there are too many factors to keep in mind, so it might not be worthy having a cache for that as it might end up slowing down the process instead of improving it — after all, the HTML resume cards are the bigger part of any page, and these would already be cached by resources module.

In any case, if later on we should need/like to cache some other aspects of the workflow, I think that each module should have it's own cache subfolder, and adopt a version number to prevent using caches created with older versions of the code.

What do you think of this approach to caching?

tajmone / PBCodeArcProto

Caching Proposal & Ideas #18