Functional dependencies

jimbaker commented 1 year ago

Consider an example loosely based on Wikipedia:

Page:

Interpolated data, in at least a top-level template (tag string). Note that the data is not necessarily the defaut wiki text, it could be part of the history.
Instead of bots updating pages for demographics/climate, interpolate in the template for this and other external data sets.
Very importantly, interpolations can have templates (tag strings) nested in them.
Incorporate other assets - images, scripts, style sheets, etc.

Internal links:

Thumbnail views for other pages
Should maintain internal consistency. (Related: deploy such that the website is always consistency.)

External links:

Last retrieved should not be manual.
Should support some sort of linkrot detection as part of deployment/ongoing maintenance.

With respect to FDOM, we have two phases:

Compute incrementally with respect to the DAG
- Pages are computed functionally - they always correspond to a DAG, based on versioned data. So when defining the interpolations n Python, maybe we could have something like @versioned_cache, which itself is a function from a factory that could generate, here's how I'm versioned, and corresponding metadata, necessary for any refresh from the source.
Compute patches
- Internal links do not form a DAG, because they will often have cycles. However, it is possible to hoist this computation out of a specific page, and instead have this as a set of pages (assets more generally) that support patching after a dependent page is touched. We use the internal link graph to determine which pages should be patched; but we don't do a traversal (DFS or BFS).
- External links can be ingested by a spider to determine consistency, triggering similar patches.

Note that this means we don't support some sort of iterative computation where we continously flow through these cycles, possibly computing a fixed point like PageRank. That's a good thing - such analysis can certainly be done on the site, it might be even a useful hoisted function, but it's not done directly by FDOM itself in its own computation.

Note that we can determine if FDOM always forms a DAG on every update by using the transitive closure, as provided by the closure module in SQLite.

jimbaker commented 1 year ago

Note that we can determine if FDOM always forms a DAG on every update by using the transitive closure, as provided by the closure module in SQLite.

Unfortunately this isn't possible, because the closure extension only supports a single parent. Incremental algorithms exist for this problem, but not anything I have seen for SQLite.

There are algorithms for maintaining the incremental topological ordering, including this implementation in Rust. https://docs.rs/incremental-topo/latest/incremental_topo/

Alternatively we can just detect cycles when processing a topological sort, which is O(n) - easy to do with a recursive CTE on SQLite. This probably is fine for now.

pauleveritt commented 1 year ago

Fun topic! One small point: I think the problem of cycles is solvable if you link to a target in the resource, not the resource.

pauleveritt commented 1 year ago

Next, I wonder if we need to store link/reference information in a separate table. I wrote this up a little in my science fiction.

Imagine docA links to docB with this Markdown: as we see in [this doc](./docB)

Our FDOM parses the Markdown into an VDOM-ish thing: ["as we see in", Link(ref="/folder1/doc2", attr="title", value="Doc2 Title")]

This node in the VDOM has the link information as well as the most recent value. If the title on /folder1/doc2 changes, we just do a SQL query -- that is, a query on computed column that's indexed -- to find any Link with that ref. If the value is different....etc.

If it can be moved to a query, and a recursive query, we don't have to store relationship state in a separate place.

jimbaker commented 1 year ago

Fun topic! One small point: I think the problem of cycles is solvable if you link to a target in the resource, not the resource.

Right, this should be true. If I have a pure chain of function calls (so the FDOM model), I cannot build a cyclic graph. (Indeed it's only a DAG because we can and should allow for caching.) So we really should only see those cycles in phase 2 computations (internal/external links), but we have the patch solution for that.

jimbaker commented 1 year ago

Next, I wonder if we need to store link/reference information in a separate table. I wrote this up a little in my science fiction.

Imagine docA links to docB with this Markdown: as we see in [this doc](./docB)

Our FDOM parses the Markdown into an VDOM-ish thing: ["as we see in", Link(ref="/folder1/doc2", attr="title", value="Doc2 Title")]

This node in the VDOM has the link information as well as the most recent value. If the title on /folder1/doc2 changes, we just do a SQL query -- that is, a query on computed column that's indexed -- to find any Link with that ref. If the value is different....etc.

My implicit thinking here is that the link generation could be some fairly arbitrary function that expands into the template. So we would want to capture that - in this page, we generated all of these links, now let's do something with that. So that's easier to do, and reflects that we have build products (these links, pages, etc), and what drives it (the VDOM). Both stored in the database.

If it can be moved to a query, and a recursive query, we don't have to store relationship state in a separate place.

However, it would be relatively cheap to do so. And it's still an easier query. For this updated page, find everything I need to patch. Or vice versa, for these external links, this is what we need to feed our spider to confirm "no link rot", and if so, what page to update.

I found this discussion on N+1 queries being just fine in SQLite quite interesting, and relevant to what we are discussing: https://www.sqlite.org/np1queryprob.html

pauleveritt commented 1 year ago

Last comment first...very interesting point about N+1 and that article. It's indeed been a worry for me. I've hedged thinking "if worse comes to worse, keep a pre-computation around like the git index file, or write a virtual table in Rust." But those options...suck.

Nice to hear we might go a LONG way without heroic measures. Obviously, we haven't even thought about a good indexing strategy yet.

pauleveritt commented 1 year ago

"Very importantly, interpolations can have templates (tag strings) nested in them." I have my take on that (related to component boundaries for "fine grained reactivity" and smaller rebuilds. But perhaps you mean something else?
"Incorporate other assets." This is where pre-rendering performance goes to die. Giving a vehicle for better thinking on this might be the second-biggest impact.
For links, the Sphinx feature where you can get the link text from the target's title is (still) really unique and really useful. It's also a performance/consistency killer. 😀 It makes a good use case for this architecture.
"Links...Should maintain internal consistency." I've long been intrigued on this. How close can one get to almost reaching referential integrity and a guarantee? A pre-rendering system with actual transactions and a build model designed around triggers could get most of the way there. (Existing systems rely on logic to go look for problems, rather than baking this into the state model.)
"External links". Another good point and place for optimization. Side note: There's a variation of this with Intersphinx. A system which is "outside", but uses same state design.
"Compute incrementally". If I may, change that to "multiply incrementally." If we're willing to have multiple phases that an artifact goes through, and we keep intermediate results around, the surface area of a patch could be really small.
I wonder if the build phase thing I describe in the science fiction might be at odds with the functional approach, meaning, a better alternative.
"like @versioned_cache". If the interpolations call to a component or service, those things would be wrapped with @component or @decorator. That's where registration, injection, and dependency tracking happen. So it could also be the place where the persistence/caching happens.
"Internal links do not form a DAG". That's exciting about patching.
"External links ... triggering similar patches." That's a fantastic idea. I wonder if the external process (the spider) could simply make external data look like "part of the system", under all the same consistency rules. This would apply to an inter sphinx also.

jimbaker commented 1 year ago

"Very importantly, interpolations can have templates (tag strings) nested in them." I have my take on that (related to component boundaries for "fine grained reactivity" and smaller rebuilds. But perhaps you mean something else?

Yes, such recursive interpolations give us components.

"Incorporate other assets." This is where pre-rendering performance goes to die. Giving a vehicle for better thinking on this might be the second-biggest impact.

As the next generation, we have to solve the problems of the previous generation tool; or make them irrelevant.

For links, the Sphinx feature where you can get the link text from the target's title is (still) really unique and really useful. It's also a performance/consistency killer. 😀 It makes a good use case for this architecture.

I haven't looked at how Wikipedia actually does this support (pre-render or on-the fly with some sort of client JS), but regardless very doable with this approach. I basically know nothing about Sphinx, other than reading the various sites created with it, so you will have to guide me here.

"Links...Should maintain internal consistency." I've long been intrigued on this. How close can one get to almost reaching referential integrity and a guarantee?

I feel like my knowledge is completely dated here. But working on first principles:

blue/green (or possible multiple versions). Using load balancer/DNS/whatever switching scheme, a client only sees the blue site, then the green site. rsync or similar uses the fact that deltas are relatively small here, and only makes incremental changes to the blue, then to the green. Then flip the switch.
I understand that CDNs provide more sophisticated support for this, including variations on A/B, but need to investigate further what they can do, and how to use. Regardless if they can be targeted for a consistent build, we can do that, since we know exactly what changed.

A pre-rendering system with actual transactions and a build model designed around triggers could get most of the way there. (Existing systems rely on logic to go look for problems, rather than baking this into the state model.)

It helps this is an original design goal. This is just a build step that gates other steps. And yes, we have complete reproducibility from source files; and we always work within transactions. Yay, SQLite!

"External links". Another good point and place for optimization. Side note: There's a variation of this with Intersphinx. A system which is "outside", but uses same state design.

👍

"Compute incrementally". If I may, change that to "multiply incrementally." If we're willing to have multiple phases that an artifact goes through, and we keep intermediate results around, the surface area of a patch could be really small.

Sounds reasonable. We don't have to limit the phases.

I wonder if the build phase thing I describe in the science fiction might be at odds with the functional approach, meaning, a better alternative.

Maybe?

"like @versioned_cache". If the interpolations call to a component or service, those things would be wrapped with @component or @decorator. That's where registration, injection, and dependency tracking happen. So it could also be the place where the persistence/caching happens.

Basically decorators get us ways to inject more metadata and before/after on functions. Decorators with args, or decorators that are built in some interesting way, give us more of that. They can definitely be woven together - @component and @versioned_cache can and should work together with some good semantics on their composition.

"Internal links do not form a DAG". That's exciting about patching.

Agreed. Even applies to anchors.

"External links ... triggering similar patches." That's a fantastic idea. I wonder if the external process (the spider) could simply make external data look like "part of the system", under all the same consistency rules. This would apply to an inter sphinx also.

We capture a state of the external world with respect to the spider. It can grow stale, but at least we updated with respect to that state. Repeat on a regular basis.

Real example: web sites, such as the Washington Post (I believe this happened to them) will have external links that go out of business, or worse bought/taken over by nefarious actors. The spider approach minimizes this inconistency with the previous world.

A trusted third party may aggregate some of this, so changes to outdated/spammy/malicious can be propagated to customers efficiently. This is like Google Chrome's own registry, but potentially knows why those links were there to begin with.

pauleveritt / fdom

Functional dependencies #4