Cross-reference take most of time when building large package with a probable bug in make_xref

Describe the bug

Building large packages takes a lot of time, most of which is cost on resolving cross-references.

When building a package of 70MB and 3399 files (azure-mgmt-containerservice), it cost 3615 seconds and 2806 seconds are cost on resolving cross reference.

"Fuzzy" searching mode of PythonDomain.find_obj method is iterating all loaded objects in the memory, when the package has a lot of objects and xref to resolve, the build process is very slow.

The searchmode parameter is designed to control whether using Fuzzy mode to resolve cross references, but I think it's not working under most cases (e.g. if I cross-reference the built-in :class: str, Sphinx will use fuzzy mode to match it).

I think the problem is in PyXrefMixin.make_xref, attribute 'refspecific' is set to True, and won't be changed whatever the reftarget string is. Is it the expected behavior? Update the node attributes according to "refspecific" in return value back seems more reasonable after parsing reftarget.

And can we have a switch in config files to only find exact match and don't fuzzy search when building a package?

Screenshot: Full iterating for fuzzy mode:

Seems should update the node attributes "refspecific" after parse_reftarget.

How to Reproduce

Reproduce python script:

from sphinx.application import Sphinx

app = Sphinx(
        srcdir= './rst',
        confdir='./conf',
        outdir= './output',
        buildername='html',
        doctreedir='./doctrees'
    )
app.build()

Command to run cprofile:

 python -m cProfile -o "stats-compare" -s 'cumulative' .\run-sphinx-with-cprofile.py

RSTs are generated by apidoc:

sphinx-apidoc --module-first --no-headings --no-toc --implicit-namespaces -o "" ""

conf.py:

# -*- coding: utf-8 -*-

source_suffix = '.rst'
master_doc = 'index'
project = u''
copyright = u''
author = u''
version = '0.1'
release = '0.1'
language = None
exclude_patterns = ['_build']
pygments_style = 'sphinx'
todo_include_todos = False
html_theme = 'bizstyle'
html_static_path = ['_static']
htmlhelp_basename = 'Example Document'
extensions = [
    'sphinx.ext.autodoc',
    'sphinx.ext.napoleon'
]

intersphinx_mapping = {'python': ('https://docs.python.org/3.6', None)}
# Make Google-style and Numpy-style Example work.
napoleon_use_admonition_for_examples = True

exclude_patterns = ['_build', 'Thumbs.db', '.DS_Store']
autodoc_mock_imports = ['google']
autoclass_content = 'both'
autodoc_default_options = {
    'inherited-members': True
}

Environment Information

Platform:              win32; (Windows-10-10.0.22621-SP0)
Python version:        3.9.13 (tags/v3.9.13:6de2ca5, May 17 2022, 16:36:42) [MSC v.1929 64 bit (AMD64)])
Python implementation: CPython
Sphinx version:        6.1.3
Docutils version:      0.19
Jinja2 version:        3.0.1
Pygments version:      2.14.0

Sphinx extensions

['sphinx.ext.autodoc']

Additional context

No response

What I understood

Cross-references are resolved by running the ReferencesResolver post-transform which dispatches this request to the domain. The domain resolves references in

https://github.com/sphinx-doc/sphinx/blob/39fa276b360dc36953b3ba62e19d6a978cd5109c/sphinx/domains/python.py#L1367

The "fuzzy" or "non-fuzzy" mode seems to be enabled whenever refspecific is set (independently of its value):

https://github.com/sphinx-doc/sphinx/blob/39fa276b360dc36953b3ba62e19d6a978cd5109c/sphinx/domains/python.py#L1372-L1374

I think that this was decided in order to have the minimal amount of information on the nodes and reduce the space (and time) complexity but this might be a separated bug.

From what I know and what I could find, the flow is as follows:

Upon encountering :type my_param: MyType, a new PyGroupedField is created. Since PyGroupedField inherits from PyXrefMixin and the mixin forces the inliner to be None, a pending_xref node is created and processed (not resolved):

https://github.com/sphinx-doc/sphinx/blob/2c83af0aab7080e0b78d4a16981eed878b2cac4c/sphinx/util/docfields.py#L76-L85
With searchmode=1, Sphinx tries to find the following names with the same object type (i.e., only the object types associated with the reference role that was used are considered):
- mod.cls.name
- mod.name
- cls.name
- all targets ending with .name
In searchmode=0, only the three first names are searched but the reference role is not taken into account. In particular, there might be a lot of invalid results (like a function pointing to a method with the same name) because the reference role is not considered at all.
I don't know whether the calls to find_obj are issued from PyXrefMixin objects or if they are caused by other reference objects. Patching PyXrefMixin might not be sufficient but I will try to do it and measure the timings.

Bug or feature ?

One probable bug is the fact that refspecific=False is never handled (because node.hasattr('refspecific') returns True if refspecific=False). I highly suspect that there are other places where attributes values are never handled and only their presence are acknowledged. By the way, type_to_xref seems to handle the refspecific flag so there might be some distinctions between these two mechanisms.

As for the time spent during the resolution, it is also due the fact that there are many references to resolve (there are 280k calls to find_obj !). Unfortunately there is no way to "cache" the results because everything is context-dependent. We could add a configuration value for "common" types or intersphinx inventory, and say "if you encounter that name, resolve it using that thing instead of this".

Suggestions

One way to reduce the workload is to change the structure of the objects that were found. Currently, this is a huge dictionary mapping fully qualified names to their corresponding object entry, but we could perhaps change the representation (e.g., using a sorted dictonary in order to locate the chunks more precisely).

In addition, I want to isolate the caller causing most of the "long" calls. That way, it would be easier to optimize some parts of the code.

After some investigation, here are my findings (feel free to argue):

PyXrefMixin always set refspecific=True and does not care about how the reference was presented. Consider the following example:


.. py:function:: foo(x)

   docstring 

   :param x: SOME_TYPE
   :type x: SOME_TYPE

For me, the value of refspecific should depend on the configuration value python_use_unqualified_type_names and whether . is used as a prefix of SOME_TYPE. The presence of the . dictates the value of refspecific.

Now, I observed the following weird stuff in PyXrefMixin.make_xref. First, it is not possible to have SOME_TYPE equals to .Bar or .foo.Bar since otherwise the reftarget value is .foo.Bar. This is quite problematic and I would expect that the RHS of :type ...: to behave as if it were using :class:`SOME_TYPE`, as explained by:

https://github.com/sphinx-doc/sphinx/blob/39fa276b360dc36953b3ba62e19d6a978cd5109c/sphinx/domains/python.py#L357-L359

The reason why I cannot use :type x: .Bar or :type x: .foo.Bar lies in:

https://github.com/sphinx-doc/sphinx/blob/39fa276b360dc36953b3ba62e19d6a978cd5109c/sphinx/domains/python.py#L381-L395

Case `reftitle != reftarget`

In this case, the pending_xref node will consist of a single text. This happens when prefixing the reference by ~ or typing. (e.g., ~foo.Bar or typing.Dict). In other words, the target being resolved is everything after ~.

I don't know whether this is the proper usage of ~, but when I use ~, I use a fully qualified name afterwards. So, I don't see why we need to search with refspecific=True. Also, the way parse_reftarget is implemented forbids the combination of ~ with . It is possible to use :class:`~.foo.Bar` for instance, but when using it in :type x: ~.foo.Bar, we end up with a target .foo.Bar instead of foo.Bar.

Case `reftitle == reftarget` and `python_use_unqualified_type_names=True`

In this case, the pending_xref node has two children and reftarget == reftitle. These two children are both required but their content should be pruned from any leading . if any. Again, one cannot use a type starting with a dot since it, e.g.:

<pending_xref py:class="True" py:module="main" refdomain="py" refexplicit="False" refspecific="True" reftarget=".foo.Bar" reftype="class">
  <pending_xref_condition condition="resolved">
    <literal_emphasis>
      Bar
  <pending_xref_condition condition="*">
    <literal_emphasis>
      .foo.Bar

Ideally, the reference target should be foo.Bar and the second resolution name should be foo.Bar and not .foo.Bar.

The optimizations I had in mind are quite simple. Everything is due to the fact that refspecific=True is assumed. This probably helps in many situations since that way the user does not need to put . explicitly. Instead, I suggest the following:

Add a configuration value for the default value of refspecific in PyXrefMixin. For compatibility, it will be set to True. It can be False, in which case the user needs to put . or a fully qualified name (but explicit links are not supported, namely I cannot write :type x: title <target>, but this is a separate issue).

If the value is None, we could internally decide the refspecific value. For instance, if the RHS of :type x: already contains dots (but does not start with one), then we could assume that it is sufficiently precise and won't check for all targets ending by that name. We should discuss about this approach as well.
Currently, the matches are constructed as follows:

https://github.com/sphinx-doc/sphinx/blob/39fa276b360dc36953b3ba62e19d6a978cd5109c/sphinx/domains/python.py#L1346-L1348

Instead of indexing the objects by their name and iterating over the whole set of objects, I suggest sorting the dictionary in reverse alphabetical order so that the suffix can be matched on a range rather than iterating over the full collection of objects.

This approach however requires to freeze the dictionary or to implement a structure similar to use a list-like object for "insert & sort" operations but that can be used as a dictionary (for compatibility reasons). We could use https://grantjenks.com/docs/sortedcollections/ or a custom implementation.

Important note

I was writing my reply when I wanted to check the list of object types being tested. And actually I think there is an important bug in sphinx.ext.intersphinx. Consider the following file:

.. py:function:: foo(x, y)

   docstring

   :param x: x
   :type x: typing.Any
   :param y: y
   :type y: typing.Any

https://github.com/sphinx-doc/sphinx/blob/39fa276b360dc36953b3ba62e19d6a978cd5109c/sphinx/domains/python.py#L1328-L1331

we have objtypes = self.objtypes_for_role(type). But I was surprised to see the following output when sphinx.ext.intersphinx is loaded:

# when processing x
['function', 'data', 'class', 'exception', 'method', 'classmethod', 'staticmethod', 'attribute', 'property', 'module']

# when processing y
['function', 'data', 'class', 'exception', 'method', 'classmethod', 'staticmethod', 'attribute', 'property', 'module', 'method']

In particular, there is some side-effect due to the dispatcher roles ! the 'method' is duplicated ! and actually, it grows indefinitely. I confirmed that the bug only occurs if sphinx.ext.intersphinx is loaded:

https://github.com/sphinx-doc/sphinx/blob/5a6b2b16b6b588cdd4dab1b43b7873153da10baa/sphinx/ext/intersphinx.py#L365-L370

The if statements never check that the object types already exist. This is critical for performances. For small projects, we don't really care but for very large projects, it might be a real issue (especially since there is an undesirable side-effect on the object types of the domain itself).

Thanks @picnixz ! Agree with you. I think another place to fix is not to assign searchmode=1 when node has attribute of refspecific but the value is false. e.g. currently when node['refspecific'] == false, searchmode = 1 if node.hasattr('refspecific') else 0 will still assign 1 to searchmode

https://github.com/sphinx-doc/sphinx/blob/39fa276b360dc36953b3ba62e19d6a978cd5109c/sphinx/domains/python.py#L1367-L1376

I think another place to fix is not to assign searchmode=1 when node has attribute of refspecific but the value is false.

I resisted against that for this PR. I am not entirely sure that the Sphinx implementation actually assumes this or not. This is a trade-off between usability and time-space complexity (time because has() is faster than get() and space because it may save a refspecific=False). Now, if we allow a user specifying refspecific explicitly, we should indeed refactor this (possibly adding a DeprecationWarning along the way because it is quite important for extensions?).

By the way, docutils or Sphinx have the convention (at least for directives) to use a directive presence as an enabled flag. If we follow that convention (but don't allow custom default refspecific), this is not really a bug, but this should be documented somewhere (I don't know if it is already the case).

OK, a config to turn-off default refspecific would be enough for unblocking me. It seems for the long term, the solution is to only assign refspecific when necessary.

from https://github.com/sphinx-doc/sphinx/issues/11282#issuecomment-1511900129

And actually I think there is an important bug in sphinx.ext.intersphinx. [...] The 'method' is duplicated ! and actually, it grows indefinitely. [...] The if statements never check that the object types already exist.

@picnixz Would it be a quick and simple fix to make objtypes a set instead of a list? That would trivially solve the duplication issue. And it should speed up find_obj a bit because the in objtypes becomes O(1) instead of O(n). - Since objtypes will not be too large in general, I checked that even for 1-3 elements in for sets is slightly faster than for lists. So, AFAICS there's nothing to loose in terms of performance.

The duplication issue has been solved separately (#11337) but the reason why we are using a list instead of a set is to retain the discovery order (needed for reproducible builds). IIRC, the objtypes are initially ordered in priority order (and extensions could make changes to that). In intersphinx, we are iterating over that list so we need a structure that is order preserving and that has a O(1) lookup (amortized O(1) is also fine I think).

We need a structure that is order preserving and that has a O(1) lookup (amortized O(1) is also fine I think).

If we don't want dependencies like sortedcollections, a simple dict with None values would fulfill that. Though I suspect the lookup is not the real bottleck (would need timing), and then it's questionable wether one should go for the slightly awkward misuse of using a dict as a sorted set.

sphinx-doc / sphinx