Untangling `app`, `env`, and the domains

AA-Turner commented 3 weeks ago

Is your feature request related to a problem? Please describe.

The project has several central internal types, Sphinx, BuildEnvironment, Domain (& subclasses), and Project.

Of these, only Project is well-isolated from the rest of the classes. Creating a Sphinx object internally creates a BuildEnvironment object, which in turn creates an instance of all 10 builtin domains. All of this happens within Sphinx.__init__() and creates mutual dependencies between app, env, and each domain. This also happens to a lesser extent with the Builder object.

This design makes it hard to effectively seperate concerns. It makes designing a better parallel system harder, and means that testing/mocking parts of the application is much harder than it could be.

I would like to see if there is a realistic way we can uncouple these mutual dependencies, and provide a migration path for downstream extensions.

Describe the solution you'd like

My current idea is to break BuildEnvironment into three parts:

A dataclass with information about the current document (all that is currently the env.tmp_data dict.
A dataclass with information about the build at large
A type that provides utility functions and contains the current-document and global dataclasses.

I believe this would allow us to pass around the data objects and effectivley construct BuildEnvironment accessor/wrappers on the fly where needed. For example, we can pass Domain.__init__ the domaindata dict as part of the global dataclass, instead of the entire env object as at present.

I am very open to other ideas though, as there may be approaches I haven't considered. Keeping backwards compatibility and providing a migration path for downstream users will be key, but I think should be doable with clever use of __getattr__ etc.

Describe alternatives you've considered

Do nothing. The easiest option, but leaves us in a less-than ideal situation.

cc: @sphinx-doc/developers

A

electric-coder commented 3 weeks ago

dataclass

A somewhat convoluted datatype that IMO isn't easy to work with (too many implicit rules). It might be easier to just go with a regular class, even if it's not minimal it should be easier to extend and understand.

and creates mutual dependencies

First step, IMO, start drawing UMLs of the status quo and begin thinking about gradual changes from there. (Since the application is large PlantUML might not cut it, a first step might be agreeing on an adequate modeling tool everyone can work with.)

AA-Turner commented 3 weeks ago

'dataclass' in this instance is just an abstract class that holds data, it doesn't need to be dataclasses.dataclass.

Do we need to be as formal as UML diagrams? Sketching the core interactions (and abstracting the details) is fairly straightforward.

Sphinx.__init__ sets self.env to the result of BuildEnvironment.__init__, passing itself as self
BuildEnvironment.__init__ sets self.app, creating a mutual dependency, and sets self.domains which contains each of the 10 domains, which each call Domain.__init__, passing the environment object as self.
Domain.__init__ sets self.env, creating another mutual dependency.
All of this happens whilst initialising the Sphinx object, effectively making the entire application one big reference cycle.

This means that domains can access config values by doing self.env.app.config.spam or similar, which is very useful—what I want to do is untangle the creation of these objects.

A

P.S. apologies for any mistakes, I've written this on mobile.

timhoffm commented 3 weeks ago

Since these objects are public to extensions, we need to be careful with the refactoring. But that's solvable, likely one has to introduce (pending?) deprecations on undesired attributes so that extensions have the change to migrate away before we change the API.

A related topic is that extensions tend to store state by dynamically adding attributes to the above objects. We should have a cleaner way to handle state of extensions. It may be desirable or necessary to do this as part of the untangling.

@electric-coder I agree that visualizations are often helpful. You are welcome to draw a diagram of the current relevant relationships between the above mentioned classes. We don't need full diagrams of the full application and for now a one-time draw is sufficient to illustrate the current state, therefore, you may use whichever tool you like.

electric-coder commented 3 weeks ago

Do we need to be as formal as UML diagrams?

I'd say getting the class diagrams is always worth it and the extra investment pays off immediately. There are practical problems however: tools that reverse the code to produce an adequate UML are generally proprietary. An alternative is using Pyreverse but producing a good layout is an NP hard problem so if the code has many classes the diagram it's likely to yield might be nearly unreadable. At one point you may want to do some manual layout but not every tool (like PlantUML) supports that. Thus an additional problem is created of maintaining UMLs.

effectively making the entire application one big reference cycle.

Designing an application to be a nice DAG generally has me keeping UMLs. Maybe because I'm getting older I've given up on trying to solve dependencies relying on memory alone.

Do we need to be as formal as UML diagrams?

So, returning to the original question, for me it's not a matter of formality but of cognitive necessity.

chrisjsewell commented 3 weeks ago

Do we need to be as formal as UML diagrams?

yep I'm definitely +1 for diagramming the proposed before/after

It doesn't have to be anything formal, just sketch it out in powerpoint or whatever (I have literally done that before 😉 https://github.com/aiidateam/aiida-core/pull/5330)

chrisjsewell commented 3 weeks ago

A related topic is that extensions tend to store state by dynamically adding attributes to the above objects. We should have a cleaner way to handle state of extensions.

yeh this is definitely somethat that should be done first; at present, unless you are creating a domain, there really is no obvious API for storing data

timhoffm commented 3 weeks ago

It doesn't have to be anything formal, just sketch it out in powerpoint or whatever

drawio is quite convenient for manual drawing.

AA-Turner commented 3 weeks ago

An initial attempt:

Current

sphinx-uml-current

Proposed

sphinx-uml-proposed

UML (current)

```plantuml @startuml skin rose hide class empty members class Sphinx { config: Config env: BuildEnvironment project: Project __init__(*args, **kwds): None } class Project { } class Config { config_values: dict[str, _Opt] __init__(): None } class BuildEnvironment { app: Sphinx config: Config docname: str domaindata: dict[str, dict[str, Any]] domains: Collection[Domain] project: Project temp_data: dict[str, Any] __init__(app: Sphinx): None setup(app: Sphinx): None } class Domain { data: dict[str, Any] env: BuildEnvironment __init__(env: BuildEnvironment): None } Sphinx *-- Config: config Sphinx *-- BuildEnvironment: env Sphinx *-- Project: project BuildEnvironment o-- Sphinx: app BuildEnvironment *-- Domain: domains Domain o-- BuildEnvironment: env @enduml ```

UML (proposed)

```plantuml @startuml skin rose hide class empty members class Sphinx { config: Config env: BuildEnvironment project: Project __init__(*args, **kwds): None } class Project { } class Config { config_values: dict[str, _Opt] __init__(): None } class BuildEnvironment { app: Sphinx config: Config __init__(app: Sphinx): None setup(app: Sphinx): None } class DomainData { domaindata: dict[str, dict[str, Any]] } class EnvData { domains: Collection[Domain] project: Project } class CurrentDocumentData { docname: str temp_data: dict[str, Any] } class Domain { data: dict[str, Any] env: BuildEnvironment __init__(domain_data: DomainData, env_data: EnvData, current_doc: CurrentDocumentData): None } Sphinx *-- Config: config Sphinx *-- BuildEnvironment: env Sphinx *-- Project: project BuildEnvironment o-- Sphinx: " _app" BuildEnvironment *-- DomainData: domain_data BuildEnvironment *-- EnvData: env_data BuildEnvironment *-- CurrentDocumentData: current_doc EnvData *-- Domain: domains @enduml ```

A

timhoffm commented 3 weeks ago

Thanks for drawing the diagrams. Some remarks from a first look:

In current, BuildEnvironment contains references to config and project. There are no relational arrows drawn for these. Applies equally for the kept BuildEnvironment.config in proposed.
What do you want to achieve with proposed? Could you comment on the reasoning? It's difficult to figure this out from the diagram. For example, the DomainData class seems to pull out essentially one variable, which does not make sense to me. Also, more generally, it feels a bit weird to have BuildEnvironment.domaindata at all, in particular when there is also Domain.data, which AFAICS references the same dicts. BuildEnvironment.domaindata has a comment # domain-specific inventories, here to be pickled. If pickling is the only reason, that exists, IMHO it would be better to handle these things and not carry them around in the datastructure persistently. One could either do this via __setstate__ / __getstate__ (if one wants to rely on plain pickling of just the env; or, one could handle domain data explicitly on the outside, because we control the code that does the pickling/unpickling

electric-coder commented 3 weeks ago

Excellent effort @AA-Turner 👍 it's evident now the problem won't be easy to solve.

This signature is obviously problematic:

Domain.__init__(domain_data: DomainData, env_data: EnvData, current_doc: CurrentDocumentData): None

If each Domain instance holds a reference to the instances of the 3 new classes as an attribute that would count as an empty diamond. Ideally since Domain is a leaf in the instance order it should only be called, it shouldn't need to be calling the containing/composing class EnvData or the other classes that compose BuildEnvironment. So while the 3-split of BuildEnvironment is a step in the right direction the proposed solution doesn't yet seem to solve the circular dependency that's the hardest problem.

What seems to have happened is that "for convenience" Domain received BuildEnvironment and then called it internally to simplify the signatures and take advantage of the OO, but even if no env attribute were held by Domain and it only read a BuildEnvironment instance that would still be an association without a diamond - and still, a mutual functional dependency.

I also suppose you want to split the circular reference between the Sphinx and BuildEnvironment classes at a latter stage. It makes sense to start with the leaves before moving up to the root of the dependencies.

But lets say the maintainers managed to split, normalize, and decouple the whole thing... What kind of an interface would then allow backward compatibility (this remits me back to your initial announcement...) to be superimposed on the new implementation devoid of circular-references (an uncommon exercise for sure!)?

picnixz commented 3 weeks ago

Here are my 2 cents (subject to modifications and maybe ideas)

For DomainData, I fail to understand why it would be passed to the Domain constructor. I think you meant data: dict[str, Any] as an initial domain data? and that DomainData is actually a mapping from domain names to their actual domain's data? And that DomainData would expose:

DomainData.store(domain_name: str, key: str, value: Any): add some data for a given domain, storing it under key.
DomainData.fetch(domain_name: str, key: str, default: Any): fetch domain-specific data
Some other methods perhaps.

Since the domains are created in the Environment's constructor, they also have a reference to the incomplete application (at the moment of the call, the application is still in __init__() and builders are not set up yet!). Now, instead of accessing it via the environment, they can access it if we pass around the application to their setup() method instead. So extensions doing self.env.app in Domain.setup() would simply take app from it. And if they need to keep a reference to the application, they can store it when doing Domain.setup() (which should could as the "constructor finalization" when the environment is ready).

Note that we also need to address how serialization of components would work and if we can reconstruct them in the correct order. If we are trying to de-couple dependencies but in the end, we have something even more complex (and hard to follow), I'm not sure we need the change.

However, what was perhaps not said is that all constructors are actually not really constructors because they assume that the inputs are incomplete. When domains are constructed, they are so in the Environment constructor, which itself is called from the application constructor when the latter is not yet finished (builders are still not setup entirely!). So, instead of relying of a 2-step construction with an additional setup() function that acts as a post-initialization, I think we should use "contexts" instead of incomplete objects and the constructors would pick whatever they need from those contexts. The creation of the context can be done separtely as well and on the fly.

In other words, we may try a builder pattern instead of eager constructors (AFAICT, this is what you want: decouple the creation of each components and not have it in the application's constructor, right?)

A related topic is that extensions tend to store state by dynamically adding attributes to the above objects. We should have a cleaner way to handle state of extensions. It may be desirable or necessary to do this as part of the untangling.

I have a private repo with an interface for adding state to extension. Essentially, it's just a huge dictionary where keys are "tokens" for extensions and you can manage your own storage as you want.

Additional data coming from extensions should be stored in the environment so to save them and pickle them.

electric-coder commented 3 weeks ago

In other words, we may try a builder pattern instead of eager constructors

The right solution.

Essentially, it's just a huge dictionary where keys are "tokens"

I was drawing an UML based on the proposal and since support for Python 3.10 is being dropped the correct solution to get rid of the dependencies might be going for ParamSpec in the constructors to avoid keeping circular references to instances yet formalize the interface whilst having it type checked.

AA-Turner commented 3 weeks ago

In current, BuildEnvironment contains references to config and project. There are no relational arrows drawn for these. Applies equally for the kept BuildEnvironment.config in proposed.

I wasn't really sure how many arrows you're meant to include. I left them out because the project attribute on env is extracted from app, rather than passed separately.

What do you want to achieve with proposed? Could you comment on the reasoning? It's difficult to figure this out from the diagram. For example, the DomainData class seems to pull out essentially one variable, which does not make sense to me.

Remove the mutual dependencies between types' constructors.

app should set things up, manage extension interfaces (add_*()) and run the builder.

env currently plays a triple(?) role:

Holding global constants (eg Project, Config, etc). These shouldn't change by the time that they get to env
Holding mutable global data generated during reading documents (eg all reference targets, all objects, domain data, custom data from an extension)
Holding per-document data (eg the document name)
A number of utility methods

By splitting these into three, we can be more specific in what we need/use at each call-site.

Also, more generally, it feels a bit weird to have BuildEnvironment.domaindata at all, in particular when there is also Domain.data, which AFAICS references the same dicts. BuildEnvironment.domaindata has a comment # domain-specific inventories, here to be pickled.

Each Domain.data dict is a value of env.domaindata, keyed to the domain name. I don't know why this design was originally chosen (simplicity?), but I think it's part of the whole data-store problem.

When doing parallel builds, data from each forked environment object gets merged back together, and it is the only(?) supported location for storing data during reading. Domains are nicer to store things in mainly as the .data attribute abstracts the involvement of env.

One end goal of this work is to restructure our parallel support, but I wanted to split the tasks up as we'll end up having a hopelessly large scope.

If pickling is the only reason, that exists, IMHO it would be better to handle these things and not carry them around in the datastructure persistently. One could either do this via __setstate__ / __getstate__ (if one wants to rely on plain pickling of just the env; or, one could handle domain data explicitly on the outside, because we control the code that does the pickling/unpickling

This would be a good idea to experiment with. We would need to be careful re env-merge-info etc and ensure that if we started storing data on a domain with no reference to the environment that we merged it back in after parallel reading.

If each Domain instance holds a reference to the instances of the 3 new classes as an attribute that would count as an empty diamond. Ideally since Domain is a leaf in the instance order it should only be called, it shouldn't need to be calling the containing/composing class EnvData or the other classes that compose BuildEnvironment. So while the 3-split of BuildEnvironment is a step in the right direction the proposed solution doesn't yet seem to solve the circular dependency that's the hardest problem.

You're right, we'd want to split whatever held config and other environment constants from the container of domains itself.

For DomainData, I fail to understand why it would be passed to the Domain constructor. I think you meant data: dict[str, Any] as an initial domain data? and that DomainData is actually a mapping from domain names to their actual domain's data? And that DomainData would expose:

DomainData.store(domain_name: str, key: str, value: Any): add some data for a given domain, storing it under key.

DomainData.fetch(domain_name: str, key: str, default: Any): fetch domain-specific data

Some other methods perhaps.

This seems better. I removed a lot of detail from the diagrams, and perhaps there was other domain-data attributes to go in that type? But yes each domain should just have a dict[str, Any] initial data.

Since the domains are created in the Environment's constructor, they also have a reference to the incomplete application (at the moment of the call, the application is still in __init__() and builders are not set up yet!). Now, instead of accessing it via the environment, they can access it if we pass around the application to their setup() method instead. So extensions doing self.env.app in Domain.setup() would simply take app from it. And if they need to keep a reference to the application, they can store it when doing Domain.setup() (which should could as the "constructor finalization" when the environment is ready).

This is interesting, though I'd prefer not to need to expose app to the domains. What APIs are missing from env that we could add? (Perhaps for later discussion)

two-stage setup() function

Note that this will be needed for as long as we use pickling, as unpickling an object calls __setstate__ and not __init__

A related topic is that extensions tend to store state by dynamically adding attributes to the above objects. We should have a cleaner way to handle state of extensions. It may be desirable or necessary to do this as part of the untangling.

I have a private repo with an interface for adding state to extension. Essentially, it's just a huge dictionary where keys are "tokens" for extensions and you can manage your own storage as you want.

Additional data coming from extensions should be stored in the environment so to save them and pickle them.

Having a better interface such as this would be good. Currently we recommend in documentation to set an attribute directly, which doesn't integrate well with static typing.

A

timhoffm commented 3 weeks ago

What I'm quite unclear about is how extensions fit in here. My understanding is that they are implicitly spread across the whole architecture: They can add config values, they may store information in the BuildEnvironment, they may hook into app events, they add or modify domains. Are they well enough isolated to not pose a problem to planned refactorings?

electric-coder commented 3 weeks ago

I wasn't really sure how many arrows you're meant to include.

It's optional. You can draw an arrow, or declare an attribute/argument with a type, or both - they're equivalent although with different expressiveness (depends on what conciseness/explicitness you want). I did draw the URL with all the arrows and it was quite a mess :P

So here's a hybrid sketch joining the post_init() and "Builder pattern" that @picnixz mentioned (also to not let @AA-Turner be the lone user that contributes a UML).

Sphinx_builder_pattern

PlantUML source code

```PlantUML @startuml skin rose hide class empty members class Sphinx { config: Config env: BuildEnvironment project: Project __init__(*args, **kwargs) } class Project { } class Config { config_values: dict[str, _Opt] __init__(*args, **kwargs) } class BuildEnvironment { __init__(*args, **kwargs) post_init(???): Self setup(app: Sphinx): None } note top of BuildEnvironment: How to solve the\nsetup(app) dependency? note left of BuildEnvironment: A chain of post_init() calls\nsharing ParamSpec? class DomainData { domaindata: dict[str, dict[str, Any]] post_init(???): Self } class EnvData { domains: Collection[Domain] post_init(???): Self } class CurrentDocumentData { docname: str temp_data: dict[str, Any] post_init(???): Self } class _BuilderPattern1 { __init__(\n config: Config, project: Project, env: BuildEnvironment) post_init(???): tuple[Config, Project, BuildEnvironment] } class Domain { data: dict[str, Any] __init__(*args, **kwargs) post_init(???): Self } Sphinx *-- Config: config Sphinx *-- BuildEnvironment: env Sphinx *-- Project: project BuildEnvironment *-- DomainData: domain_data BuildEnvironment *-- EnvData: env_data BuildEnvironment *-- CurrentDocumentData: current_doc EnvData "1" *-- "1...N" Domain: domains _BuilderPattern1 --> Project _BuilderPattern1 --> BuildEnvironment _BuilderPattern1 --> Config Sphinx +-- _BuilderPattern1 @enduml ```

Are they well enough isolated to not pose a problem to planned refactorings?

Dunno about the internals, but in the UML _BuilderPattern1 could also be a kind of top-level class and I suppose extensions also behave like top-level classes that have visibility over the entire API that's exposed.

picnixz commented 3 weeks ago

This is interesting, though I'd prefer not to need to expose app to the domains. What APIs are missing from env that we could add? (Perhaps for later discussion)

Do you mean things you cannot access as env.thing unless you do env.app.thing? I don't really remember but there shouldn't have much. IIRC, at the time the domain is constructed, the things being available are:

config & project
pre-loaded builders (they are still in an incomplete state at the time of the call I think)
extensions after they've called setup() (so they could also be in an incomplete state if they rely on config-inited)
application output streams (maybe that?)
catalogs (maybe this?)
event manager (accessible at the level of env.emit I think? or do we need env.app.emit?)

I don't have the code in front of me so there are probably stuff I'm missing.

Having a better interface such as this would be good

I can try to extract it and make it public if you want. And make it more robust. Actually, temporary data can just be regarded as a namespace that you carry over. It's essentially the same as a dictionary and autodoc could also benefit of similar stuff. I have contextual data that I need to carry over autodoc so what I essentially do is just creating some kind of "smart namespace".

We had a similar discussion in #12198 concerning the typing of the Config object and how to make it properly extensible.

sphinx-doc / sphinx

Untangling `app`, `env`, and the domains #13072