Open AA-Turner opened 1 week ago
dataclass
A somewhat convoluted datatype that IMO isn't easy to work with (too many implicit rules). It might be easier to just go with a regular class, even if it's not minimal it should be easier to extend and understand.
and creates mutual dependencies
First step, IMO, start drawing UMLs of the status quo and begin thinking about gradual changes from there. (Since the application is large PlantUML might not cut it, a first step might be agreeing on an adequate modeling tool everyone can work with.)
'dataclass' in this instance is just an abstract class that holds data, it doesn't need to be dataclasses.dataclass
.
Do we need to be as formal as UML diagrams? Sketching the core interactions (and abstracting the details) is fairly straightforward.
Sphinx.__init__
sets self.env
to the result of BuildEnvironment.__init__
, passing itself as self
BuildEnvironment.__init__
sets self.app
, creating a mutual dependency, and sets self.domains
which contains each of the 10 domains, which each call Domain.__init__
, passing the environment object as self
.Domain.__init__
sets self.env
, creating another mutual dependency.Sphinx
object, effectively making the entire application one big reference cycle.This means that domains can access config values by doing self.env.app.config.spam
or similar, which is very useful—what I want to do is untangle the creation of these objects.
A
P.S. apologies for any mistakes, I've written this on mobile.
Since these objects are public to extensions, we need to be careful with the refactoring. But that's solvable, likely one has to introduce (pending?) deprecations on undesired attributes so that extensions have the change to migrate away before we change the API.
A related topic is that extensions tend to store state by dynamically adding attributes to the above objects. We should have a cleaner way to handle state of extensions. It may be desirable or necessary to do this as part of the untangling.
@electric-coder I agree that visualizations are often helpful. You are welcome to draw a diagram of the current relevant relationships between the above mentioned classes. We don't need full diagrams of the full application and for now a one-time draw is sufficient to illustrate the current state, therefore, you may use whichever tool you like.
Do we need to be as formal as UML diagrams?
I'd say getting the class diagrams is always worth it and the extra investment pays off immediately. There are practical problems however: tools that reverse the code to produce an adequate UML are generally proprietary. An alternative is using Pyreverse but producing a good layout is an NP hard problem so if the code has many classes the diagram it's likely to yield might be nearly unreadable. At one point you may want to do some manual layout but not every tool (like PlantUML) supports that. Thus an additional problem is created of maintaining UMLs.
effectively making the entire application one big reference cycle.
Designing an application to be a nice DAG generally has me keeping UMLs. Maybe because I'm getting older I've given up on trying to solve dependencies relying on memory alone.
Do we need to be as formal as UML diagrams?
So, returning to the original question, for me it's not a matter of formality but of cognitive necessity.
Do we need to be as formal as UML diagrams?
yep I'm definitely +1 for diagramming the proposed before/after
It doesn't have to be anything formal, just sketch it out in powerpoint or whatever (I have literally done that before 😉 https://github.com/aiidateam/aiida-core/pull/5330)
A related topic is that extensions tend to store state by dynamically adding attributes to the above objects. We should have a cleaner way to handle state of extensions.
yeh this is definitely somethat that should be done first; at present, unless you are creating a domain, there really is no obvious API for storing data
It doesn't have to be anything formal, just sketch it out in powerpoint or whatever
drawio is quite convenient for manual drawing.
An initial attempt:
Current
Proposed
A
Thanks for drawing the diagrams. Some remarks from a first look:
BuildEnvironment
contains references to config
and project
. There are no relational arrows drawn for these. Applies equally for the kept BuildEnvironment.config
in proposed.BuildEnvironment.domaindata
at all, in particular when there is also Domain.data
, which AFAICS references the same dicts. BuildEnvironment.domaindata
has a comment # domain-specific inventories, here to be pickled
. If pickling is the only reason, that exists, IMHO it would be better to handle these things and not carry them around in the datastructure persistently. One could either do this via __setstate__
/ __getstate__
(if one wants to rely on plain pickling of just the env; or, one could handle domain data explicitly on the outside, because we control the code that does the pickling/unpicklingExcellent effort @AA-Turner 👍 it's evident now the problem won't be easy to solve.
This signature is obviously problematic:
Domain.__init__(domain_data: DomainData, env_data: EnvData, current_doc: CurrentDocumentData): None
If each Domain
instance holds a reference to the instances of the 3 new classes as an attribute that would count as an empty diamond. Ideally since Domain
is a leaf in the instance order it should only be called, it shouldn't need to be calling the containing/composing class EnvData
or the other classes that compose BuildEnvironment
. So while the 3-split of BuildEnvironment
is a step in the right direction the proposed solution doesn't yet seem to solve the circular dependency that's the hardest problem.
What seems to have happened is that "for convenience" Domain
received BuildEnvironment
and then called it internally to simplify the signatures and take advantage of the OO, but even if no env
attribute were held by Domain
and it only read a BuildEnvironment
instance that would still be an association without a diamond - and still, a mutual functional dependency.
I also suppose you want to split the circular reference between the Sphinx
and BuildEnvironment
classes at a latter stage. It makes sense to start with the leaves before moving up to the root of the dependencies.
But lets say the maintainers managed to split, normalize, and decouple the whole thing... What kind of an interface would then allow backward compatibility (this remits me back to your initial announcement...) to be superimposed on the new implementation devoid of circular-references (an uncommon exercise for sure!)?
Here are my 2 cents (subject to modifications and maybe ideas)
For DomainData
, I fail to understand why it would be passed to the Domain constructor. I think you meant data: dict[str, Any]
as an initial domain data? and that DomainData
is actually a mapping from domain names to their actual domain's data? And that DomainData
would expose:
DomainData.store(domain_name: str, key: str, value: Any)
: add some data for a given domain, storing it under key
.DomainData.fetch(domain_name: str, key: str, default: Any)
: fetch domain-specific dataSince the domains are created in the Environment's constructor, they also have a reference to the incomplete application (at the moment of the call, the application is still in __init__()
and builders are not set up yet!). Now, instead of accessing it via the environment, they can access it if we pass around the application to their setup()
method instead. So extensions doing self.env.app
in Domain.setup()
would simply take app
from it. And if they need to keep a reference to the application, they can store it when doing Domain.setup()
(which should could as the "constructor finalization" when the environment is ready).
Note that we also need to address how serialization of components would work and if we can reconstruct them in the correct order. If we are trying to de-couple dependencies but in the end, we have something even more complex (and hard to follow), I'm not sure we need the change.
However, what was perhaps not said is that all constructors are actually not really constructors because they assume that the inputs are incomplete. When domains are constructed, they are so in the Environment constructor, which itself is called from the application constructor when the latter is not yet finished (builders are still not setup entirely!). So, instead of relying of a 2-step construction with an additional setup() function that acts as a post-initialization, I think we should use "contexts" instead of incomplete objects and the constructors would pick whatever they need from those contexts. The creation of the context can be done separtely as well and on the fly.
In other words, we may try a builder pattern instead of eager constructors (AFAICT, this is what you want: decouple the creation of each components and not have it in the application's constructor, right?)
A related topic is that extensions tend to store state by dynamically adding attributes to the above objects. We should have a cleaner way to handle state of extensions. It may be desirable or necessary to do this as part of the untangling.
I have a private repo with an interface for adding state to extension. Essentially, it's just a huge dictionary where keys are "tokens" for extensions and you can manage your own storage as you want.
Additional data coming from extensions should be stored in the environment so to save them and pickle them.
In other words, we may try a builder pattern instead of eager constructors
The right solution.
Essentially, it's just a huge dictionary where keys are "tokens"
I was drawing an UML based on the proposal and since support for Python 3.10 is being dropped the correct solution to get rid of the dependencies might be going for ParamSpec
in the constructors to avoid keeping circular references to instances yet formalize the interface whilst having it type checked.
In current,
BuildEnvironment
contains references toconfig
andproject
. There are no relational arrows drawn for these. Applies equally for the keptBuildEnvironment.config
in proposed.
I wasn't really sure how many arrows you're meant to include. I left them out because the project
attribute on env
is extracted from app
, rather than passed separately.
What do you want to achieve with proposed? Could you comment on the reasoning? It's difficult to figure this out from the diagram. For example, the DomainData class seems to pull out essentially one variable, which does not make sense to me.
Remove the mutual dependencies between types' constructors.
app
should set things up, manage extension interfaces (add_*()
) and run the builder.
env
currently plays a triple(?) role:
env
By splitting these into three, we can be more specific in what we need/use at each call-site.
Also, more generally, it feels a bit weird to have
BuildEnvironment.domaindata
at all, in particular when there is alsoDomain.data
, which AFAICS references the same dicts.BuildEnvironment.domaindata
has a comment# domain-specific inventories, here to be pickled
.
Each Domain.data
dict is a value of env.domaindata
, keyed to the domain name. I don't know why this design was originally chosen (simplicity?), but I think it's part of the whole data-store problem.
When doing parallel builds, data from each forked environment object gets merged back together, and it is the only(?) supported location for storing data during reading. Domains are nicer to store things in mainly as the .data
attribute abstracts the involvement of env
.
One end goal of this work is to restructure our parallel support, but I wanted to split the tasks up as we'll end up having a hopelessly large scope.
If pickling is the only reason, that exists, IMHO it would be better to handle these things and not carry them around in the datastructure persistently. One could either do this via
__setstate__
/__getstate__
(if one wants to rely on plain pickling of just the env; or, one could handle domain data explicitly on the outside, because we control the code that does the pickling/unpickling
This would be a good idea to experiment with. We would need to be careful re env-merge-info
etc and ensure that if we started storing data on a domain with no reference to the environment that we merged it back in after parallel reading.
If each Domain instance holds a reference to the instances of the 3 new classes as an attribute that would count as an empty diamond. Ideally since Domain is a leaf in the instance order it should only be called, it shouldn't need to be calling the containing/composing class EnvData or the other classes that compose BuildEnvironment. So while the 3-split of BuildEnvironment is a step in the right direction the proposed solution doesn't yet seem to solve the circular dependency that's the hardest problem.
You're right, we'd want to split whatever held config
and other environment constants from the container of domains itself.
For
DomainData
, I fail to understand why it would be passed to the Domain constructor. I think you meantdata: dict[str, Any]
as an initial domain data? and thatDomainData
is actually a mapping from domain names to their actual domain's data? And thatDomainData
would expose:
DomainData.store(domain_name: str, key: str, value: Any)
: add some data for a given domain, storing it underkey
.DomainData.fetch(domain_name: str, key: str, default: Any)
: fetch domain-specific data- Some other methods perhaps.
This seems better. I removed a lot of detail from the diagrams, and perhaps there was other domain-data attributes to go in that type? But yes each domain should just have a dict[str, Any]
initial data.
Since the domains are created in the Environment's constructor, they also have a reference to the incomplete application (at the moment of the call, the application is still in
__init__()
and builders are not set up yet!). Now, instead of accessing it via the environment, they can access it if we pass around the application to theirsetup()
method instead. So extensions doingself.env.app
inDomain.setup()
would simply takeapp
from it. And if they need to keep a reference to the application, they can store it when doingDomain.setup()
(which should could as the "constructor finalization" when the environment is ready).
This is interesting, though I'd prefer not to need to expose app
to the domains. What APIs are missing from env
that we could add? (Perhaps for later discussion)
two-stage
setup()
function
Note that this will be needed for as long as we use pickling, as unpickling an object calls __setstate__
and not __init__
A related topic is that extensions tend to store state by dynamically adding attributes to the above objects. We should have a cleaner way to handle state of extensions. It may be desirable or necessary to do this as part of the untangling.
I have a private repo with an interface for adding state to extension. Essentially, it's just a huge dictionary where keys are "tokens" for extensions and you can manage your own storage as you want.
Additional data coming from extensions should be stored in the environment so to save them and pickle them.
Having a better interface such as this would be good. Currently we recommend in documentation to set an attribute directly, which doesn't integrate well with static typing.
A
What I'm quite unclear about is how extensions fit in here. My understanding is that they are implicitly spread across the whole architecture: They can add config values, they may store information in the BuildEnvironment, they may hook into app events, they add or modify domains. Are they well enough isolated to not pose a problem to planned refactorings?
I wasn't really sure how many arrows you're meant to include.
It's optional. You can draw an arrow, or declare an attribute/argument with a type, or both - they're equivalent although with different expressiveness (depends on what conciseness/explicitness you want). I did draw the URL with all the arrows and it was quite a mess :P
So here's a hybrid sketch joining the post_init()
and "Builder pattern" that @picnixz mentioned (also to not let @AA-Turner be the lone user that contributes a UML).
Are they well enough isolated to not pose a problem to planned refactorings?
Dunno about the internals, but in the UML _BuilderPattern1
could also be a kind of top-level class and I suppose extensions also behave like top-level classes that have visibility over the entire API that's exposed.
This is interesting, though I'd prefer not to need to expose app to the domains. What APIs are missing from env that we could add? (Perhaps for later discussion)
Do you mean things you cannot access as env.thing
unless you do env.app.thing
? I don't really remember but there shouldn't have much. IIRC, at the time the domain is constructed, the things being available are:
setup()
(so they could also be in an incomplete state if they rely on config-inited
)env.emit
I think? or do we need env.app.emit
?)I don't have the code in front of me so there are probably stuff I'm missing.
Having a better interface such as this would be good
I can try to extract it and make it public if you want. And make it more robust. Actually, temporary data can just be regarded as a namespace that you carry over. It's essentially the same as a dictionary and autodoc could also benefit of similar stuff. I have contextual data that I need to carry over autodoc so what I essentially do is just creating some kind of "smart namespace".
We had a similar discussion in #12198 concerning the typing of the Config
object and how to make it properly extensible.
Is your feature request related to a problem? Please describe.
The project has several central internal types,
Sphinx
,BuildEnvironment
,Domain
(& subclasses), andProject
.Of these, only
Project
is well-isolated from the rest of the classes. Creating aSphinx
object internally creates aBuildEnvironment
object, which in turn creates an instance of all 10 builtin domains. All of this happens withinSphinx.__init__()
and creates mutual dependencies betweenapp
,env
, and each domain. This also happens to a lesser extent with theBuilder
object.This design makes it hard to effectively seperate concerns. It makes designing a better parallel system harder, and means that testing/mocking parts of the application is much harder than it could be.
I would like to see if there is a realistic way we can uncouple these mutual dependencies, and provide a migration path for downstream extensions.
Describe the solution you'd like
My current idea is to break
BuildEnvironment
into three parts:env.tmp_data
dict.I believe this would allow us to pass around the data objects and effectivley construct
BuildEnvironment
accessor/wrappers on the fly where needed. For example, we can passDomain.__init__
thedomaindata
dict as part of the global dataclass, instead of the entire env object as at present.I am very open to other ideas though, as there may be approaches I haven't considered. Keeping backwards compatibility and providing a migration path for downstream users will be key, but I think should be doable with clever use of
__getattr__
etc.Describe alternatives you've considered
Do nothing. The easiest option, but leaves us in a less-than ideal situation.
cc: @sphinx-doc/developers
A