Caches not rebuilt during git operations

conartist6 commented 3 years ago

Wow it took me a long time to consider this problem : (

If our goal is not to have to do rebuild when a git checkout occurs, our algorithm can't rely on any hidden cache data because that data would not be rebuilt. Right now the most important way we use hidden cache data is to know how to delete files -- we essentially cache a graph of which files were written by which generators so that we can delete files which were written when a source disappears. For example let's say we have a config which takes *.orig.js and generates *.gen.js. While checking out a commit foo.orig.js and the generated foo.gen.js are created, but no changeset will be created for foo.orig.js. In this state when foo.orig.js is deleted foo.gen.js remains because it is not clear how to delete it.

Our original design for generators was not subject to this problem because it made deletion the responsibility of the generator itself. Thus the generator would simply process the remove foo.orig.js change and would be required to remove foo.gen.js. Presumably assistance would be offered to the author of generators in order to make sure they get this right.

conartist6 commented 3 years ago

Hidden cache data is also used in the reduce hook of generators, where it records the result of each gen.map() so that it can be an input to gen.reduce(). Thus for example index generators would tend to render stale imports in the index, e.g. importing files which no longer exist or failing to import new files. This would happen on the first rebuild of the index on the new branch.

conartist6 commented 3 years ago

Here's a quick review of where I've been on this: I started (in iter-tools/generate) with a design that worked like this. I'll call it v1, though these version numbers aren't reflected anywhere else. The main problem with v1 is that it was a base class into which was baked a lot of assumptions and functionality. This created poor separation of concerns: for example I want my error handling mechanism to work even if a generator overrode the base logic or was written for a slightly different version of macrome. In the v1 API macrome was responsible for instantiating generators, which allowed it to inject itself into the generator.

v2 of the generator API dropped the class-based structure. Generators are now only required to be objects. If they have a map or a reduce hook it will be called. Because generators need only be objects, they can also be class instances, which allows the user to write a generator as a class and then instantiate it with options, or even more than once with different sets of options. v2 has the cache problems described in this issue.

v3 is the API that will be created to fix this issue. It must combine the best aspects of v1 and v2.

conartist6 commented 3 years ago

Another note: I must weight redesigning the API against just rebuilding the caches. It is possible to simply defer vcs changes instead of dropping them. I only need to read the headers of changed files, and at worse this would be like a full scan, which I do on startup anyway. The existence of generatedfrom headers should make it possible to rebuild the caches.

Plus I'm already thinking about situations in which I need to defer VCS changes, e.g. because ephemeral generators need to be able to rebuild their files.

conartist6 commented 3 years ago

Another thought: what happens when a generator's implementation performs some dynamic logic and decides whether or not to generate a particular output file? Let's say on a first run it decides to create foo.optional.js from foo.js, but then foo.js changes in a way that causes the dynamic logic to flip and the next run of the generator does not output foo.optional.js. At this point we want to understand that foo.optional.js is stale and should be deleted. Having always-up-to-date metadata will ensure that we can create and maintain an actual change graph, and that as we traverse the graph generating output we can clean also diff against prior state and use the result to remove stale files.

conartist6 commented 3 years ago

After thinking about this a while I'm pretty sure the right thing to do is to make this the responsibility of changesets. Changesets associate one non-generated file with all the generated files that get made on its behalf. What we want to do is to keep multiple copies of that state -- one from before some watcher change and one from after running our generators. Then we can diff the two and figure out if there's anything built in the filesystem that should be removed.

conartist6 commented 2 years ago

After thinking about this a while I'm pretty sure the right thing to do is to make this the responsibility of changesets.

I was most right on the 13th. Changesets were eliminated because to have a graph you need one map which allows looking up your nodes, and it is highly useful in catching corner cases to treat the problem as a graph traversal. Previously each changeset had a queue and some state built into it. Now we have a single macrome.state map and a single macrome.queue which sequences work and caches file state from the graph to ensure a previous state is present to diff against.

macrome-js / macrome

Caches not rebuilt during git operations #27