tajmone / PBCodeArcProto

PB CodeArchiv Rebirth Indexer Prototype
4 stars 0 forks source link

Caching Proposal & Ideas #18

Open tajmone opened 6 years ago

tajmone commented 6 years ago

In the last weeks I've been looking at the issue of caching, from different angles.

I've been thinking that caching will really have to deal mainly with resources — ie, validating the resouces and extracting their key-value pairs from the comments, and cache the pairs and the HTML resume card produced from them.

It seems to me that it would make sense to hide the caching system inside the module that will deal with resources. Also, I think that any operation on a resource file should silently run all checks and extractions for the resource, and cache the result. This would be handled behind the module scenes, and the user shouldn't worry about it, but simply use the module functions as required.

Here is an example of how this could work...

Suppose the resources module (res::) exposed different procedures, something like:

Regardless of the fact that each of the above procedure has a different aim, and returns different results, the module should carry out some basic cache operations for each of them:

This approach means that caching is never a separate operation but is integrated in every single resource access. Also, it means that the user won't have to worry about carrying out integrity checks on a resource before using it, since these will always be implicitly handled by the cache system.

As for the caching implementation, I think that using MD5 algorithm to detect if a file has changed should be fine enough. The cache should work as follows:

Note that the last point also implies that if a cached file exists for every resource then it means that all resources are good to go — which provides a very fast way of checking if the project is ready for updating the website pages.

As an example, let's say that we use res::CreateCard( resfile.s ), and that resfile has correct header comments but was saved with compiler settings at the end of the source. In this case, the procedure will return a well-formed HTML card string, but the resource will not be cached because it failed the integrity check for presence of source-settings. The user doesn't know this (neither he should, nor he cares), but this is an example of how the cache works in the background.

The fact that any type of resource operations via the res module will implicitly carry out all tests and extraction for that resource, and then cache the results, also means that the since the cache is gradually built in the background it should never be too slow to run checks on all resources, since many should be pre-cached due to other operations.

I've also thought about what else could be cached, but I think that resources are the only bottleneck here. HTML pages are tricky to cache because any change in the HTML module, the pandoc template, the global YAML settings file or even a Category folder being renamed, all these could potentially affect the final HTML result. With the HTML pages creation process there are too many factors to keep in mind, so it might not be worthy having a cache for that as it might end up slowing down the process instead of improving it — after all, the HTML resume cards are the bigger part of any page, and these would already be cached by resources module.

In any case, if later on we should need/like to cache some other aspects of the workflow, I think that each module should have it's own cache subfolder, and adopt a version number to prevent using caches created with older versions of the code.

What do you think of this approach to caching?

SicroAtGit commented 5 years ago

Sounds good to me. I agree with your reasoning.