Open jimbaker opened 1 year ago
Note that we can determine if FDOM always forms a DAG on every update by using the transitive closure, as provided by the closure module in SQLite.
Unfortunately this isn't possible, because the closure extension only supports a single parent. Incremental algorithms exist for this problem, but not anything I have seen for SQLite.
There are algorithms for maintaining the incremental topological ordering, including this implementation in Rust. https://docs.rs/incremental-topo/latest/incremental_topo/
Alternatively we can just detect cycles when processing a topological sort, which is O(n) - easy to do with a recursive CTE on SQLite. This probably is fine for now.
Fun topic! One small point: I think the problem of cycles is solvable if you link to a target in the resource, not the resource.
Next, I wonder if we need to store link/reference information in a separate table. I wrote this up a little in my science fiction.
Imagine docA links to docB with this Markdown: as we see in [this doc](./docB)
Our FDOM parses the Markdown into an VDOM-ish thing: ["as we see in", Link(ref="/folder1/doc2", attr="title", value="Doc2 Title")]
This node in the VDOM has the link information as well as the most recent value. If the title
on /folder1/doc2
changes, we just do a SQL query -- that is, a query on computed column that's indexed -- to find any Link
with that ref
. If the value is different....etc.
If it can be moved to a query, and a recursive query, we don't have to store relationship state in a separate place.
Fun topic! One small point: I think the problem of cycles is solvable if you link to a target in the resource, not the resource.
Right, this should be true. If I have a pure chain of function calls (so the FDOM model), I cannot build a cyclic graph. (Indeed it's only a DAG because we can and should allow for caching.) So we really should only see those cycles in phase 2 computations (internal/external links), but we have the patch solution for that.
Next, I wonder if we need to store link/reference information in a separate table. I wrote this up a little in my science fiction.
Imagine docA links to docB with this Markdown:
as we see in [this doc](./docB)
Our FDOM parses the Markdown into an VDOM-ish thing:
["as we see in", Link(ref="/folder1/doc2", attr="title", value="Doc2 Title")]
This node in the VDOM has the link information as well as the most recent value. If the
title
on/folder1/doc2
changes, we just do a SQL query -- that is, a query on computed column that's indexed -- to find anyLink
with thatref
. If the value is different....etc.
My implicit thinking here is that the link generation could be some fairly arbitrary function that expands into the template. So we would want to capture that - in this page, we generated all of these links, now let's do something with that. So that's easier to do, and reflects that we have build products (these links, pages, etc), and what drives it (the VDOM). Both stored in the database.
If it can be moved to a query, and a recursive query, we don't have to store relationship state in a separate place.
However, it would be relatively cheap to do so. And it's still an easier query. For this updated page, find everything I need to patch. Or vice versa, for these external links, this is what we need to feed our spider to confirm "no link rot", and if so, what page to update.
I found this discussion on N+1 queries being just fine in SQLite quite interesting, and relevant to what we are discussing: https://www.sqlite.org/np1queryprob.html
Last comment first...very interesting point about N+1 and that article. It's indeed been a worry for me. I've hedged thinking "if worse comes to worse, keep a pre-computation around like the git index file, or write a virtual table in Rust." But those options...suck.
Nice to hear we might go a LONG way without heroic measures. Obviously, we haven't even thought about a good indexing strategy yet.
@component
or @decorator
. That's where registration, injection, and dependency tracking happen. So it could also be the place where the persistence/caching happens.
- "Very importantly, interpolations can have templates (tag strings) nested in them." I have my take on that (related to component boundaries for "fine grained reactivity" and smaller rebuilds. But perhaps you mean something else?
Yes, such recursive interpolations give us components.
- "Incorporate other assets." This is where pre-rendering performance goes to die. Giving a vehicle for better thinking on this might be the second-biggest impact.
As the next generation, we have to solve the problems of the previous generation tool; or make them irrelevant.
- For links, the Sphinx feature where you can get the link text from the target's title is (still) really unique and really useful. It's also a performance/consistency killer. 😀 It makes a good use case for this architecture.
I haven't looked at how Wikipedia actually does this support (pre-render or on-the fly with some sort of client JS), but regardless very doable with this approach. I basically know nothing about Sphinx, other than reading the various sites created with it, so you will have to guide me here.
- "Links...Should maintain internal consistency." I've long been intrigued on this. How close can one get to almost reaching referential integrity and a guarantee?
I feel like my knowledge is completely dated here. But working on first principles:
blue/green (or possible multiple versions). Using load balancer/DNS/whatever switching scheme, a client only sees the blue site, then the green site. rsync
or similar uses the fact that deltas are relatively small here, and only makes incremental changes to the blue, then to the green. Then flip the switch.
I understand that CDNs provide more sophisticated support for this, including variations on A/B, but need to investigate further what they can do, and how to use. Regardless if they can be targeted for a consistent build, we can do that, since we know exactly what changed.
A pre-rendering system with actual transactions and a build model designed around triggers could get most of the way there. (Existing systems rely on logic to go look for problems, rather than baking this into the state model.)
It helps this is an original design goal. This is just a build step that gates other steps. And yes, we have complete reproducibility from source files; and we always work within transactions. Yay, SQLite!
- "External links". Another good point and place for optimization. Side note: There's a variation of this with Intersphinx. A system which is "outside", but uses same state design.
👍
- "Compute incrementally". If I may, change that to "multiply incrementally." If we're willing to have multiple phases that an artifact goes through, and we keep intermediate results around, the surface area of a patch could be really small.
Sounds reasonable. We don't have to limit the phases.
- I wonder if the build phase thing I describe in the science fiction might be at odds with the functional approach, meaning, a better alternative.
Maybe?
- "like @versioned_cache". If the interpolations call to a component or service, those things would be wrapped with
@component
or@decorator
. That's where registration, injection, and dependency tracking happen. So it could also be the place where the persistence/caching happens.
Basically decorators get us ways to inject more metadata and before/after on functions. Decorators with args, or decorators that are built in some interesting way, give us more of that. They can definitely be woven together - @component
and @versioned_cache
can and should work together with some good semantics on their composition.
- "Internal links do not form a DAG". That's exciting about patching.
Agreed. Even applies to anchors.
- "External links ... triggering similar patches." That's a fantastic idea. I wonder if the external process (the spider) could simply make external data look like "part of the system", under all the same consistency rules. This would apply to an inter sphinx also.
We capture a state of the external world with respect to the spider. It can grow stale, but at least we updated with respect to that state. Repeat on a regular basis.
Real example: web sites, such as the Washington Post (I believe this happened to them) will have external links that go out of business, or worse bought/taken over by nefarious actors. The spider approach minimizes this inconistency with the previous world.
A trusted third party may aggregate some of this, so changes to outdated/spammy/malicious can be propagated to customers efficiently. This is like Google Chrome's own registry, but potentially knows why those links were there to begin with.
Consider an example loosely based on Wikipedia:
Page:
Internal links:
External links:
With respect to FDOM, we have two phases:
Compute incrementally with respect to the DAG
@versioned_cache
, which itself is a function from a factory that could generate, here's how I'm versioned, and corresponding metadata, necessary for any refresh from the source.Compute patches
Note that this means we don't support some sort of iterative computation where we continously flow through these cycles, possibly computing a fixed point like PageRank. That's a good thing - such analysis can certainly be done on the site, it might be even a useful hoisted function, but it's not done directly by FDOM itself in its own computation.
Note that we can determine if FDOM always forms a DAG on every update by using the transitive closure, as provided by the closure module in SQLite.