Typing `pending_xref` node attrributes (and maybe all `addnodes`)

chrisjsewell commented 7 months ago

Currently, pending_xref includes no information about what attributes it is expected to have: https://github.com/sphinx-doc/sphinx/blob/bfce4f54b0fc010b04f0d17ad8b9d47d8c279e1f/sphinx/addnodes.py#L477-L485

There are definitely some attributes that are expected, i.e. they are accessed with node['name'] and would KeyError except if the attribute is not present, then there are ones that are maybe optional, i.e. they are accessed with node.get('name')

I haven't checked if there are actually any open issue, but there definitely seems like there are discrepancies in the code base, here are some examples:

https://github.com/sphinx-doc/sphinx/blob/bfce4f54b0fc010b04f0d17ad8b9d47d8c279e1f/sphinx/transforms/post_transforms/__init__.py#L208-L210

here the "check" treats refdomain as if it is optional, but then the msg creation treats it as if it is required 😅

https://github.com/sphinx-doc/sphinx/blob/bfce4f54b0fc010b04f0d17ad8b9d47d8c279e1f/sphinx/builders/latex/__init__.py#L373-L375

here refsectname is treated as required, but if you actually search for refsectname in the code base, it is definitely not set in a lot of places:

sphinx/builders/texinfo.py:
  166:             sectname = pendingnode['refsectname']
sphinx/builders/latex/__init__.py:
  375:             sectname = pendingnode['refsectname']
sphinx/domains/std/__init__.py:
  776:             contnode['refsectname'] = sectname

Am I missing something here? How is some of this stuff not excepting?

@danieleades do you have any thoughts on how to type Node subclasses, to have some kind of enforcing of available attributes. Ideally it would be enforced for instantiating the node, and for accessing attributes in an existing node, but I know this would be extremely difficult, if not impossible 😏

Also cc @picnixz may have an opinion?

picnixz commented 7 months ago

Ah yes, the pending_xref node. The issue with the attributes is that there were different people implementing different things at different places and I don't think we ever had a doc telling the exact specs of that node.

here the "check" treats refdomain as if it is optional, but then the msg creation treats it as if it is required 😅

It becomes "required" in the sense that node.get('refdomain', 'std') returns 'std' or node['refdomain']. But since you excluded the 'std' case, you get node['refdomain'] so it's fine.

here refsectname is treated as required, but if you actually search for refsectname in the code base, it is definitely not set in a lot of places:

The standard domain is always present, so the corresponding transformations and post-transformations as well. In particular, when you call env.resolve_references(largetree, ...), you also call the resolve_xref which may either remove your pending_xref nodes or build dangling references nodes that are still instances of pending_xref nodes. You can see that there are not a lot of places where you explicitly create pending_xref nodes after resolving (the resolve_xref calling _resolving_ref in the standard domain is an example) so I think it's the only place which, when after resolving everything, you still have the possibility of having such nodes (with the expected attribute).

Now, we should have added a better comment or be sure that the only remaining pending_xref nodes are created by the standard domain for ref.

picnixz commented 7 months ago

As for your original question: yes and no. Yes for common attributes that are known to exist, but that's not always the case (like, refdomain might not exist in some extensions) and you might not know at construction time the final information that is needed (for instance Domain.process_field_xref might add additional information to the node when called).

danieleades commented 7 months ago

I'm not familiar with this part of the code base, so my comments will be general rather than specific.

Sphinx uses 'stringly-typed' interfaces in a lot of places by using a dict-like interface for many objects. As you observed, often without default arguments or error handling. That doesn't imply it's a bug, more likely the author knows some invariant that mypy doesn't. The problem comes if you refactor and unknowingly violate the invariant... If we can refactor more code to be interrogable by mypy then refactoring can become more "fearless"

Where we're absolutely certain that a property must always exist on a class or instance of a class then I think it makes sense to create a subclass and set the property in the init method (or potentially as class variables where appropriate) and then access these properties directly rather than through the dict-like interface.

chrisjsewell commented 7 months ago

like, refdomain might not exist in some extensions

Well, that is one of the things I'm trying to work out for myst-parser 😅 But I think this has to exist, since it is expected in a number of places:

sphinx/builders/latex/transforms.py:
  563:             if node['refdomain'] == 'math' and node['reftype'] in ('eq', 'numref'):

sphinx/domains/math.py:
  32:         node['refdomain'] = 'math'

sphinx/ext/intersphinx.py:
  751:                        (node['refdomain'], typ, node['reftarget']))

sphinx/transforms/i18n.py:
  308:             case = node["refdomain"], node["reftype"]
  313:                     node["refdomain"],

sphinx/transforms/post_transforms/__init__.py:
   87:                         domain = self.env.domains[node['refdomain']]
  210:                    (node['refdomain'], typ, target))

tests/test_extensions/test_ext_intersphinx.py:
  32:     node['refdomain'] = domain

If we can refactor more code to be interrogable by mypy then refactoring can become more "fearless"

exactly, amen to that 🙏

picnixz commented 7 months ago

I'd be happy to move out to property-based interface but I think the choice is mostly historical and docutils motivated. Alternatively, I could suggest having a mypy plugin for that instead because it could be integrated in the current framework without refactoring anything (or barely).

But I think this has to exist, since it is expected in a number of places:

It is expected but I'm not sure that it is necessarily expected at the beginning of the transformation chain. Transformations and post-transformations have different priorities and I'm pretty sure there are some that dynamically add information because their authors know some invariants (that we do not) to process it in the future.

sphinx-doc / sphinx

Typing `pending_xref` node attrributes (and maybe all `addnodes`) #12208