Incorporation of TensorStore C++ autosummary extension

jbms commented 2 years ago

I have a TensorStore C++ autosummary also mostly implemented, although not yet released in the TensorStore repository. I would like to potentially incorporate it into this theme, but wanted to find out if that might be useful to anyone else. (If not, I'll just maintain it separately within TensorStore.)

It works similarly to the TensorStore Python autosummary extension, in that every entity is documented on a separate page.

At the top level, entities are organized into groups (like doxygen groups) and e.g. a .. cpp-apigen:: group directive is used to insert the summary of the members of a group.

2bndy5 commented 2 years ago

in that every entity is documented on a separate page.

I have been browsing the tensorstore docs, and I found it confusing when I click on a cross-reference and it loads a separate page with 1 memeber on it.

I'm very interested in this C++ autosummary idea because I haven't found a good alternative to breathe that cuts Doxygen out of the workflow.

jbms commented 2 years ago

I agree that for members with just a short description, a separate page doesn't really make sense. I figured that over time the documentation would be expanded and almost everything would have a longer description and examples and such, and therefore justify a separate page.

For example, the documentation for this member is quite long and would not work well without a separate page: https://google.github.io/tensorstore/python/api/tensorstore.TensorStore.write.html

Note: The numpy docs put every function on a separate page.

But I was also thinking about adding a rule where if the documentation is short enough, not to generate a separate page.

In principle there could also be an option to do something other than separate pages for each entity --- would just have to figure out what that "something else" would be.

With the C++ autosummary, there is a first pass that uses libclang to generate a JSON description of the API (this could happen either separately from the documentation build, or as the first step of it), and then the actual autosummary extension reads this json file.

2bndy5 commented 2 years ago

Your example does make sense for using a separate page. To be clear, I was confused at the structure of the TOC. It looks like each TOC entry is specific to a class or module? I didn't expect that clicking a item in the secondary TOC would lead me to a subpage. That approach kind of exemplifies #58

2bndy5 commented 2 years ago

I've been looking into the libclang python wrapper. There is a types-clang stub lib for it also. I think the stubs are maintained by the LLVM team, but not all objects are typed properly (namely the File object). We may also have to provide conf.py options that expose CLI args to libclang (like -p for a compilation database).

The Cursor.raw_comment doesn't involve parsing, so, I'm wondering what syntax is expected & if the syntax could be configurable. Naturally, we'd have to strip the raw_comment of the actual C++ comment syntax (eg. ///, /** .. */, //!, /*! .. */, //<) and split it into lines. I think it would be easiest to expect RST syntax by default, but it would be extremely handy to support markdown and/or a doxygen flavor (which I think is a combination of markdown and javadoc syntax - still researching this bit).

jbms commented 2 years ago

The current version that I have implemented only allows /// commands and expects rST syntax, but additionally translates some doxygen commands like \param and \returns. Supporting other syntax would be possible, though.

2bndy5 commented 2 years ago

I have written a test script that strips out all comment like syntax:

def strip_comment(cmt: str = None) -> Optional[str]:
    """
    :param cmt: the `Cursor.raw_comment` property.
    :returns: A list of the lines without the surrounding C++ comment syntax.
    """
    if cmt is None:
        return None
    # the first line is never indented (in clang-parsed form)
    # so dedent all subsequent lines only
    first_line_end = cmt.find("\n")
    cmt = cmt[:first_line_end] + dedent(cmt[first_line_end:])
    # split into a list of lines & account for CRLF and LF line endings
    body = [line.rstrip("\r") for line in cmt.splitlines()]

    # strip all the comment syntax out
    if body[0].startswith("//"):
        body = [line.lstrip("/").lstrip("!").lstrip("<").strip() for line in body]
    elif body[0].startswith("/*"):
        body[0] = body[0].lstrip("/").lstrip("*").lstrip("!")
        multi_lined_asterisk = True  # should be ok for single-line comments
        if len(body) > 1:
            line_has_asterisk = [line.startswith("*") for line in body[1:]]
            multi_lined_asterisk = line_has_asterisk.count(True) == len(body) - 1
        body = [
            (line.lstrip("*").lstrip("<").strip() if multi_lined_asterisk else line.strip())
            for line in body
        ]
        body[-1] = body[-1].rstrip("*/").rstrip()
    return body

Still testing it out though...

2bndy5 commented 2 years ago

I found a similar implementation for autodoc-ing C code using libclang... It monkey patches the libclang python binding to expose methods that allow using clang to parse the comments and return them as xml (see monkey patch src here).

So, for a function's docstring that looks like


/*!
 *  \brief A member function.
 *  \param c a character.
 *  \param n an integer.
 *  \exception std::out_of_range parameter is out of range.
 *  \return a character pointer.
 */
const char* Test6::member(char c, int n) const

the patched clang binding's parsed comment can be interpreted with lxml, which results in

<Function isInstanceMethod="1" file="path\\to\\tests\\doxygen\\func.hpp" line="23" column="20">
  <Name>member</Name>
  <USR>c:@S@Test6@F@member#C#I#1</USR>
  <Declaration>const char *Test6::member(char c, int n) const</Declaration>
  <Abstract>
    <Para> A member function.  </Para>
  </Abstract>
  <Parameters>
    <Parameter>
      <Name>c</Name>
      <Index>0</Index>
      <Direction isExplicit="0">in</Direction>
      <Discussion>
        <Para> a character.  </Para>
      </Discussion>
    </Parameter>
    <Parameter>
      <Name>n</Name>
      <Index>1</Index>
      <Direction isExplicit="0">in</Direction>
      <Discussion>
        <Para> an integer.  </Para>
      </Discussion>
    </Parameter>
  </Parameters>
  <Exceptions>
    <Para> std::out_of_range parameter is out of range.  </Para>
  </Exceptions>
  <ResultDiscussion>
    <Para> a character pointer.</Para>
  </ResultDiscussion>
</Function>

This will provide a better way to parse Doxygen style comments rather than parsing the raw text manually. 👍🏼

jbms commented 2 years ago

That's interesting, I didn't know about the xml comment output from clang.

I've been working on putting together an initial/draft version of the cpp apigen based on what I implemented for tensorstore. At this point it is almost ready.

2bndy5 commented 2 years ago

I'm working in a separate repo (local only - nothing's been pushed). I'm trying to write a replacement for breathe, but there are things that doxygen does and clang doesn't; namely

parsing markdown files/text - maybe I can use the MyST parser API for this
automatic cross referencing (without any explicit syntax) - this will require a comprehensive search and replace according to parsed members (from translated units/modules). I'm not sure if this is a rabbit hole I want to explore just yet.

I eagerly await your implementation.

2bndy5 commented 2 years ago

I started exploring the XML approach because the doxygen commands can have nested paragraphs (using @par and @parblock ... @endparblock) within a command that is designed to only accept a single paragraph. Furthermore, Doxygen paragraphs end on either a blank line or a line that starts with new command that triggers a new paragraph/list/etc. It is my hope that the clang comment parser can alleviate this complexity.

2bndy5 commented 2 years ago

Doxygen's grouping is probably one of the worst implemented features it has. I quick tested this doctring

/**
 *  @ingroup group2
 *  @brief class C3 in group 2
 */
class C3 {};

and got the following parsed xml

<Class file="path\to\tests\doxygen\group.cpp" line="35" column="7">
  <Name>C3</Name>
  <USR>c:@S@C3</USR>
  <Declaration>class C3 {}</Declaration>
  <Abstract>
    <Para> class C3 in group 2</Para>
  </Abstract>
  <Discussion>
    <Verbatim xml:space="preserve" kind="verbatim"> group2</Verbatim>
  </Discussion>
</Class>

So, we'd still have to parse Verbatim nodes to extend/fix Doxygen's grouping commands. Notice that libclang does not include a cursor just for estranged comment blocks like

/**
 *  @defgroup group2 The Second Group
 *  This is the second group
 */

It also may not consolidate consecutive comment blocks like Doxygen does.

2bndy5 commented 2 years ago

I don't intend to support @page (and the like) commands because that is specific to Doxygen's ill-conceived static site generator (🤮).

jbms commented 2 years ago

To be clear --- is that XML output from clang or from doxygen?

One of the nice things about the way I'm currently parsing doc comments is that line number information is preserved, so that any Sphinx warnings and errors are shown with a location in the original source file.

In general, I think it may be sufficient to just support the most common doxygen commands, e.g. so that most doc comments don't have to be changed.

jbms commented 2 years ago

What is it that you don't like about doxygen's grouping? The grouping feature I've implemented for the C++ and Python apigen extensions is certainly inspired by doxygen, though there are some differences.

2bndy5 commented 2 years ago

To be clear --- is that XML output from clang or from doxygen?

I'm not running Doxygen at all. Its all clang output (using the monkey patch I referenced) then parsed with lxml.etree.fromstring(). The XML output you see is from print(lxml.tostring(<root _Element>, pretty_print=True).decode())

What is it that you don't like about doxygen's grouping?

It divides up the summary into "modules" (not to be confused with compilation units). Anything within groups or subgroups won't show in its original namespace's document summary. This makes navigating Doxygen-generated docs that extensively use the grouping (for C++ libs) very confusing.

2bndy5 commented 2 years ago

I see you expose filtering options in the config like dis/allow_symbols/namespaces/macros. Could this be simplified by exposing these as directive options (like autodoc's :members:)? The regex approach seems geared toward developer's ease instead of user's ease.

The ignore_template_parameters and hide_initializers seems like they would be more specific to certain object types (like classes/functions/attributes).

jbms commented 2 years ago

The way it currently works is that in the builder-inited event, which happens before sphinx actually reads any of the rst documents (or even determines the list of rst files) we parse the code, extract the API description as json, and generate the separate pages for documenting each entity. Then the directives that are parsed when sphinx processes the rst files just insert group summaries.

By the time the rst directives are parsed, it is too late to generate more files, at least with the current sphinx build process.

The built-in autosummary extension does something hacky to work around this limitation --- it first does its own "parsing" of all the source files to locate any autosummary directives, but this isn't a real parsing of the rst, just some regular expressions. That means it doesn't work for another directive to generate an autosummary directive.

In any case for tensorstore the current configuration method seemed to work pretty well, in order to exclude internal/private apis, but basically include everything else.

2bndy5 commented 2 years ago

I didn't see any undoc'd members or access specifier filtering. I'd like to add that as well, but sometimes the Cursor.access_specifier.name rightfully shows as "INVALID" (for decls directly in global/namespace scope) or "NONE" (for certain definitions)...

In research, I decided to keep a list of parsed USR and avoid serializing entities (to JSON) with duplicate USR ids (this sped up the serialization a bit). Clang seems to copy the first found docstring for decls or defs, which made having duplicate USRs redundant to me.

jbms commented 2 years ago

Currently there is no access specifier filtering, but entities without a doc comment are skipped.

There can be more than one declaration but only one definition within a single translation unit. The information from all occurrences is merged (doc comment, default template arguments).

jbms commented 2 years ago

The idea is that you arrange to include all the headers you want to document in a single translation unit (i.e. unity build), that way every header is parsed only once.

2bndy5 commented 2 years ago

I often see undoc'd member's names as if they're self-explanatory (and could be useful if cross-refing an undoc'd member).

2bndy5 commented 2 years ago

The idea is that you arrange to include all the headers you want to document in a single translation unit (i.e. unity build), that way every header is parsed only once.

Some APIs have more than 1 entrypoint though. I agree with this conception, sadly it isn't a perfect world.

2bndy5 commented 2 years ago

I wish clang would expose a way to filter AST by relativity to indexed unit.

jbms commented 2 years ago

The idea is that you arrange to include all the headers you want to document in a single translation unit (i.e. unity build), that way every header is parsed only once.

Some APIs have more than 1 entrypoint though. I agree with this conception, sadly it isn't a perfect world.

What do you mean by entrypoint? As long as they can be used together, you can just create a new header that #includes all of the headers you need.

jbms commented 2 years ago

I wish clang would expose a way to filter AST by relativity to indexed unit.

What do you have in mind as far as relativity?

2bndy5 commented 2 years ago

If it isn't directly in the translation unit's Cursor.location.file, then ignore it. That would make our lives a bit simpler, and I imagine users could better understand what needs to be specified as source files (instead tracing their own src's includes).

jbms commented 2 years ago

That could be done with the allow_paths option but is basically the opposite of what I intended as far as the unity build approach, unless your library defines everything in a single header.

jbms commented 2 years ago

The issue with parsing every public header of a library separately is that there is likely to be huge overlap in the transitive dependencies of those headers, which means a lot of headers will have to be parsed multiple times, making it much slower.

2bndy5 commented 2 years ago

What do you mean by entrypoint?

Some projects have a lib and an executable as separate building entrypoints.

As long as they can be used together, you can just create a new header that #includes all of the headers you need.

This sounds like a good idea for passing in the unsaved_files arg to clang.cindex.Index.parse(). Maybe we could ask the user for all lib sources and assemble a header file like this on the fly. Thus, the need for only 1 translation unit is ensured.

2bndy5 commented 2 years ago

Um, how do you want to collab on this? I ask because I know you like to force push after amending git history. Should I PR any proposals into the cpp-apigen branch?

I'm going to patch in support for all doc comment prefixes and a corresponding function to strip the comment syntax (resulting in just the comments' text as List[str]).

jbms commented 2 years ago

How about I'll stop pushing (force or fast-forward) to this branch and then we both can make changes by opening additional PRs against this branch?

2bndy5 commented 2 years ago

Sounds fair.

2bndy5 commented 2 years ago

I'm having trouble building on windows: It keeps failing to find any declarations. I've narrowed it down to the allow_paths option, and I suspect it has something to do with the path separator being required at the end of the pattern. I fixed this by changing https://github.com/jbms/sphinx-immaterial/blob/73d9284583dd0ab47b89320d56f2a5b523c4c78d/sphinx_immaterial/apidoc/cpp/api_parser.py#L319

to

            path = path.replace("\\", "/")

This took me a while to track down.

2bndy5 commented 2 years ago

Apparently getting the comments from tokens isn't great. We're only getting single line comments this way (those that start with //). I can't parse comment blocks from tokens (like those starting with /*) because they aren't showing up as a result of get_tokens().

I actually think this may be related to how you're traversing the file (looks like its going line-by-line - I think). In my research, I was able to use the Cursor.raw_comment attr, but that doesn't seem to work for some things (like stray comments not directly adjacent to any entities - eg @file and @copyright comments). I haven't played with macros yet, but the sphinx-c-autodoc ext's src seems to indicate that they don't have a raw_comment associated with them.

jbms commented 2 years ago

I'm having trouble building on windows: It keeps failing to find any declarations. I've narrowed it down to the allow_paths option, and I suspect it has something to do with the path separator being required at the end of the pattern. I fixed this by changing https://github.com/jbms/sphinx-immaterial/blob/73d9284583dd0ab47b89320d56f2a5b523c4c78d/sphinx_immaterial/apidoc/cpp/api_parser.py#L319

to
            path = path.replace("\\", "/")
This took me a while to track down.

Oh yes, thanks for catching that.

jbms commented 2 years ago

Apparently getting the comments from tokens isn't great. We're only getting single line comments this way (those that start with //). I can't parse comment blocks from tokens (like those starting with /*) because they aren't showing up as a result of get_tokens().

I actually think this may be related to how you're traversing the file (looks like its going line-by-line - I think). In my research, I was able to use the Cursor.raw_comment attr, but that doesn't seem to work for some things (like stray comments not directly adjacent to any entities - eg @file and @copyright comments). I haven't played with macros yet, but the sphinx-c-autodoc ext's src seems to indicate that they don't have a raw_comment associated with them.

I haven't checked but I think multiline comments will also appear as tokens. It is true that currently I am querying tokens line by line in order to find the comments that occur before a cursor. Maybe the issue is that if you query with a start location that is within a multiline comment then it doesn't get returned --- perhaps the start location has to be at or before the start of the multiline comment.

It is also indeed possible to just iterate through all the tokens, and then associate them with entities. That would allow finding @file comments. Currently I ignore them since I preferred to organize the documentation around groups rather than files, but maybe it is useful to also support file-based organization.

Parsing the comments manually from the source files # would also not be too hard --- can be done with regular expressions I think.

2bndy5 commented 2 years ago

I was also thinking that the apigen ext should be its own sub-pkg. The api_parse.py file is already over 2000 lines...

2bndy5 commented 2 years ago

I'm ok with ignoring the file-level comments since it doesn't seem applicable from the autodoc perspective (really is more specific to doxygen). I think some stray comments can also be ignored like those that define groups in doxygen using @{/@} cmds.

I still think it would be easier to try filling in the raw_comment attr if it is empty. This way the trailing docstrings are captured as well.

   int var; ///< some doc info about `var`

Instead of traversing line by line, we could traverse between cursor locations.

jbms commented 2 years ago

I was also thinking that the apigen ext should be its own sub-pkg. The api_parse.py file is already over 2000 lines...

Seems reasonable

jbms / sphinx-immaterial

Incorporation of TensorStore C++ autosummary extension #92