qri-io / qri

you're invited to a data party!
https://qri.io
GNU General Public License v3.0
1.11k stars 66 forks source link

restructure dependencies as dscache comes into being #1144

Open dustmop opened 4 years ago

dustmop commented 4 years ago

We currently have the this dependency chain in our codebase, and it has some issues:

+----------+
| cmd      |
+----------+
    |
    | **** This is a problem, these packages should be siblings
    v
+----------+
| api      |
+----------+
    |
    |
    v
+----------+
| lib      |
+----------+
    |
    |
    v
+----------+
| fsi      |
+----------+
    |
    |
    v
+----------+
| base     |
+----------+
    |
    | **** Also a problem, repo should be higher up
    v
+----------+
| repo     |
+----------+
    |
    |
    v
+----------+
| dscache  |
+----------+

As we move away from repo, and towards dscache, this should be changed to this:

+----------+     +----------+
| cmd      |     | api      |
+----------+     +----------+
    |                |
    |     /----------/
    v    v
+----------+
| lib      |------\
+----------+       |
    |              v
    |            +----------+
    |            | dscache  |<...
    |            +----------+    .
    |                |           .
    v                v           . **** Indirect, via callback
+----------+     +----------+    .
| base     |     | fsi      |    .
+----------+     +----------+   .
    |     \                    .
    |      ---------\         .
    v               v        .
+----------+     +----------+
| dsfs     |     | logbook  |
+----------+     +----------+

The key takeaways from this better dependency graph:

The main motivation for this change is dscache development. A lot of changes lead to circular dependencies that necessitate creating subpackages in dscache. For example: repo knows about dscache, but dscache can also be built from a repo (by combining logbook, refs, and fsi). So we had to move this builder into dscache/build/from_repo.go. Also, when developing qri get using dscache, we want to be able to load datasets from dscache or from fsi within dscache's package, but this leads to a circular dependency (see first chart), so the code has to live instead in dscache/loader/loader.go

b5 commented 4 years ago

We should move fsi under dscache so that it is no longer exposed to lib. This will help simplify our story about handling datasets in repo vs fsi in the same fashion.

Disagree on this. I don't think file system integration is purely the domain of a cache. In #1133, we started to talk about critical path packages organized into an import stack and event driven packages that must be siblings with few/no critical path imports. Full notes here. A lib.Instance should have an internal fsi field.

I totally see the impetus to cut down the complexity of all these subsystems interacting, just not sure that we should solve that by stacking imports.

We might be arriving at a place where subsystems need to use events to communicate between each other. The need for callbacks seems to indicate yes, but then we're stuck back at this "how do you know you're finished" problem.

dustmop commented 4 years ago

Ok, I thought some more about this, and I have a recommendation. First, agreed that FSI is not in the domain of a cache, totally correct. However, I think the current relationship with lib and fsi is a problem. There's a lot of code duplication like this:

        ref, err := base.ToDatasetRef(p.Ref, r.repo, p.UseFSI)
        if err != nil {
            return err
        }
        if p.UseFSI {
            if ref.FSIPath == "" {
                return fsi.ErrNoLink
            }
            if ds, err = fsi.ReadDir(ref.FSIPath); err != nil {
                return fmt.Errorf("loading linked dataset: %s", err)
            }
        } else {
            ds, err = dsfs.LoadDataset(ctx, r.repo.Store(), ref.Path)
            if err != nil {
                return fmt.Errorf("loading dataset: %s", err)
            }
        }

this can be seen in Get in lib/datasets.go and RenderReadme in lib/render.go and Validate in lib/datasets.go. This pattern should be extracted and put into a single location that is responsible for all ref resolution.

Relatedly, while looking at the codepath in base/save.go specifically PrepareDataset I noticed it is calling CanonicalizeDataset in order to get the previousPath. This is bad for two reasons:

Ideally, as we move to dscache, the base package wouldn't have a reference to the dscache at all, in order to match the dependency tree I made above.

Here's the proposal:

We spoke offline a few days ago, and I don't think we should switch to asynchronous events for things like writing to logbook and updating dscache. All of those activities are intrinsically synchronous and should stay that way. We should use callbacks or, even better, interfaces wherever we can to handle these dependency problems.