libsdl-org / SDL

Simple Directmedia Layer
https://libsdl.org
zlib License
9.99k stars 1.84k forks source link

Improvements to and fixes of wikiheaders.pl and sdlwiki SDL API documentation generation. #9236

Closed blaisemccourtney closed 1 month ago

blaisemccourtney commented 8 months ago

There are some improvements that could be made to the automated generation of the sdlwiki SDL API documentation, some of which I am listing below to enable developers keep track of them. I considered fixing some of them myself, but the script responsible https://github.com/libsdl-org/SDL/blob/main/build-scripts/wikiheaders.pl for the generation would probably have to be refactored before I would venture a try at working with it. 1700+ lines of Perl with heavy regex usage and long code blocks is a bit daunting if you are not familiar with Perl.

The above issues may be relevant for both SDL 2 at https://wiki.libsdl.org/SDL2/FrontPage and SDL 3 at https://wiki.libsdl.org/SDL3/FrontPage.

Related:

icculus commented 8 months ago

For what it's worth, that thing was going to be daunting in any other language than Perl, too.

That being said, this was a really good example of "I can fix this with regex" that got wildly out of control.

If there's ever time, I would probably rewrite this as such:

blaisemccourtney commented 8 months ago

In theory, an alternative solution might be to switch back to Doxygen or some other general documentation generation tool. Though I would guess that you have good reasons for switching away from Doxygen and not picking a similar tool in the first place.

icculus commented 8 months ago

The documentation is bridged bidirectionally between the wiki and the headers, so Doxygen probably isn't a solution here.

blaisemccourtney commented 8 months ago

Bidirectional bridging is an interesting functionality to have. That editing a wiki page automatically somehowupdates the corresponding header in the source code is not a common feature to have in projects, in my experience. I would imagine that it makes it easier to get contributors to update the source code documentation. I do not know what the trade-offs are like in this case. Is bidirectional bridging and editing wiki->updates source code headers something that you have been very happy about, or have been needed?

slouken commented 8 months ago

Yes, we've been very happy that people can contribute via GitHub and automatically update the headers for public release.

slouken commented 8 months ago

I know that wikiheaders.pl is daunting, and I would suggest that you create a separate python script that autogenerates the category pages (and other documentation, potentially) and we can bring that into the process as well.

Thanks!

slouken commented 8 months ago

Actually, it looks like category generation is already partially implemented in https://github.com/libsdl-org/SDL/pull/7528. @icculus, can you take a look? It would be great to get this in for the 3.0 prerelease.

blaisemccourtney commented 8 months ago

Yes, we've been very happy that people can contribute via GitHub and automatically update the headers for public release.

Well, people can always contribute via Github and to the headers for public release without a bidirectional bridge, it is just a bit easier with a bidirectional bridge.

Actually, it looks like category generation is already partially implemented in https://github.com/libsdl-org/SDL/pull/7528. @icculus, can you take a look? It would be great to get this in for the 3.0 prerelease.

Is that PR for by-name listing, by-category listing, or both? I cannot really tell from the code.

blaisemccourtney commented 8 months ago

Edit: Fixed wrong file for SDL 3.

I believe that these are the version histories of syncing for the SDL 2 and SDL 3 pages of by-name listing, respectively:

Like the live pages, both of those files are missing SDL_Surface even though the wiki page for SDL_Surface exists for both SDL 2 and SDL 3.

The SDL 2 by-name listing was last updated 3 months ago.

The SDL 3 by-name listing was last updated 16 hours ago.

blaisemccourtney commented 8 months ago

Edit: Please disregard a lot of the below, I completely forgot that the project is purely C and does not have C++.

@slouken I think I understand what is happening now (after I learned a minuscule amount of Perl to prod and prick at things), and basically, the reason why SDL_Surface is not included in the by-name category is that the mediawiki/markdown page for SDL_Surface is never generated automatically.

No matter which system for the category pages are used, the category pages appear to rely on there being accurate information in the mediawiki/markdown pages in the footer. When going from the source to the wiki, the process might be as follows: For functions, this footer and other information is parsed from the header files (!!, C++ is not generally easy to parse!), and then categorized somehow. The information is also drawn from existing wiki pages. This is then written to the wiki pages. However, for types that are not functions, nothing is parsed. It was never implemented. So the information for non-function types, like SDL_Surface, nothing is gotten from the header files, and instead everything is just taken from the existing mediawiki file for SDL_Surface, and then written back again. This fits with the mediawiki file for SDL_Surface not having had any changes for 2 years: https://github.com/libsdl-org/sdlwiki/blob/main/SDL2/SDL_Surface.mediawiki . Since the categories in that file does not include CategoryAPI, the SDL_Surface wiki page is never included in the by-name listing.

This also means that, since non-function types are never parsed from the headers, the information in the headers and in the wiki can get arbitrarily out of sync. I have only investigated the direction from header-source to wiki. I do not know what the state of the script is for the other direction.

It would be solved if more non-function types are parsed (and documentation for the other direction is written?) . But the script is already long enough, and C++ is not at all among the easier languages to parse, even if you use proper parser techniques instead of regex (ab)use. I cannot in good conscience suggest anyone to add more parsing of C++ code using Perl regexes, especially to a Perl script that is already 1700+ lines long.

Some of the wiki pages will stay out of sync with the header code unless manually updated, namely the wiki pages for the non-function types, if my understanding of the scripts is correct.

I would personally have off-loaded the burden of parsing C++ for the sake of retrieving documentation, and the other documentation generation tasks, to Doxygen or Sphinx or some other documentation generator. Or maybe at least search for a Perl library or a library in a different language for parsing C++ comments.

I have not investigated the other direction, namely from wiki pages to header documentation.

icculus commented 8 months ago

Parsing C (let alone C++!) was never a goal; it looks for lines with a very specific format to identify function signatures, and would likely do the same for other things like structs, no matter what changes we make.

blaisemccourtney commented 8 months ago

My bad, I completely forgot that the project is purely C. And C should, despite #include, be much easier to parse than C++. Getting specific data structures out of the format is arguably still a kind of parsing, though, even if it is limited and does not parse all structures, only looking for some structures like type signatures and documentation comments. Using regexes for this purpose can be on the fragile side, but you have already mentioned that previously.

blaisemccourtney commented 8 months ago

@icculus I have searched on the net for Perl and Python libraries to parse C header files (or C programs in general), including documentation, but what I found either had not been maintained for years, or dropped comments from the generated data structure. I then looked at Doxygen, and saw that it can generate XML output.

I have also looked at the parsing/extraction code in 'wikiheaders.pl', and it looks fragile, which is to be expected given the complexity of parsing even limited C code, as well as the approach of using regexes as you yourself mentioned previously. I also looked at the formatting in the header files.

Typical formats in the header files

For the header files in 'include/SDL3', the code and the comments appear to follow some patterns.

For functions, it can for instance look like this:

/**
 * Get the name of the current camera driver.
 *
 * The returned string points to internal static memory and thus never becomes
 * invalid, even if you quit the camera subsystem and initialize a new driver
 * (although such a case would return a different static string from another
 * call to this function, of course). As such, you should not modify or free
 * the returned string.
 *
 * \returns the name of the current camera driver or NULL if no driver has
 *          been initialized.
 *
 * \threadsafety It is safe to call this function from any thread.
 *
 * \since This function is available since SDL 3.0.0.
 */
extern DECLSPEC const char *SDLCALL SDL_GetCurrentCameraDriver(void);

Which for the non-comment block is relatively straightforward, and for the comment block is a bit complicated but doable.

For a struct, it can for instance look like this:

/**
 *  SDL_CameraSpec structure
 *
 * \sa SDL_GetCameraDeviceSupportedFormats
 * \sa SDL_GetCameraFormat
 *
 */
typedef struct SDL_CameraSpec
{
    SDL_PixelFormatEnum format; /**< Frame format */
    int width;                  /**< Frame width */
    int height;                 /**< Frame height */
    int interval_numerator;     /**< Frame rate numerator ((dom / num) == fps, (num / dom) == duration) */
    int interval_denominator;   /**< Frame rate demoninator ((dom / num) == fps, (num / dom) == duration) */
} SDL_CameraSpec;

or

typedef struct SDL_GamepadBinding
{
    SDL_GamepadBindingType input_type;
    union
    {
        int button;

        struct
        {
            int axis;
            int axis_min;
            int axis_max;
        } axis;

        struct
        {
            int hat;
            int hat_mask;
        } hat;

    } input;

    SDL_GamepadBindingType output_type;
    union
    {
        SDL_GamepadButton button;

        struct
        {
            SDL_GamepadAxis axis;
            int axis_min;
            int axis_max;
        } axis;

    } output;

} SDL_GamepadBinding;

Which would not be quite as straightforward to parse, at least using Perl regexes.

Plan and proposal

My understanding of 'wikiheaders.pl' is that non-function type code and documentation is neither parsed from the header files nor written back to the header files.

Given the difficulty of parsing with regexes, and that parsing even some of C for the sake of handling documentation is not trivial, the most clear path if the current system is kept is to use Doxygen or some other tools to parse the header files, and then generate XML output containing the documentation in a structured form. And then use a Perl or Python script to parse the XML and generate the corresponding wiki pages. For the sake of not disturbing the currently working bidirectional documentation system where people can edit the wiki and have the comments automatically inserted and committed to the header source files, this usage of Doxygen could be limited to non-function types. That way, if run periodically, non-function types like SDL_Surface would have an up to date wiki page generated for it automatically, and function types would function the same as before.

@icculus Do you have any thoughts or comments on this? And if you think the above might be a decent plan, would it be OK for me to take a stab at it and making a PR? My specific plan would be in the code to:

The attempt might fail, maybe the XML output from Doxygen is not suitable, maybe something else, but I am willing to take a stab at it. I would also look at having this be generated for 'include/SDL2', the only difference between handling SDL2 and SDL3 should be which parameters are passed into the Python script.

blaisemccourtney commented 8 months ago

The main problem I see with my proposal above is that user edits to non-function types to the wiki would be ignored or overwritten. I do not know what the best solution to that would be.

icculus commented 8 months ago

I'm curious to see if this works out, but you should be aware this sounds like a lot of work and there's no guarantee this is going to be accepted, since there are a lot of unknowns about this right now.

blaisemccourtney commented 8 months ago

@icculus You might be right about that. Do you have any specific unknowns in mind or that you know of, or is it more that this is uncharted territory and unanticipated issues may crop up? I am also uncertain myself about how a PR might pan out.

Do you have any other thoughts on this proposal? What about the issue in https://github.com/libsdl-org/SDL/issues/9236#issuecomment-1999539878 ?

blaisemccourtney commented 8 months ago

This article warns against using the SDL wiki, in "Mistake 4: Using the SDL wiki for API documentation" : https://nullprogram.com/blog/2023/01/08/ .

slouken commented 8 months ago

This predates the wiki bridge for SDL3, and while there’s still lots of room for improvement, the wiki should be much better now.

slouken commented 8 months ago

The author’s main point is that the headers are the authoritative documentation, which is completely true.

blaisemccourtney commented 8 months ago

@slouken Most or all of the sub-issues that I have listed in this issue also currently hold for the SDL3 wiki. And if wiki pages are never generated for non-function types, including for the SDL3 wiki, only generated for function types, then it would take a lot of continued work and effort, with a lot of room for errors, to maintain and keep both the SDL2 and SDL3 wikis up to date. Will the SDL3 wiki really ever get to a good state if wiki pages are not generated for non-function types?

And I can understand that the headers being authoritative is fine, but what I am used to for documentation in other libraries is that the API documentation is always up to date relative to the source documentation by being generated from the source header files or corresponding source files themselves, for instance by using tools like Doxygen. I understand that it is more difficult to create quality API documentation for languages like C or C++, and that some dislike Doxygen, and I could adapt if told clearly that the SDL2/SDL3 wiki pages are not in a good state. But for newcomers, they are currently presented indirectly as the main way of viewing API documentation, there is a clear box on the website with the header "Documentation", with a link to the SDL wiki. And the main wiki page for both SDL2 and SDL3 says "This wiki is your portal to documentation and other resources for SDL 2.0." (it actually currently says "SDL 2.0" also for the SDL3 wiki, it has not yet been updated: https://wiki.libsdl.org/SDL3/FrontPage ).

It is perhaps not surprising that the SDL2 and SDL3 wikis are not in a good state if the main developers of SDL do not use them, eating your own dog food and all that. And I am not asking you to begin using the SDL wikis. But it might be better for newcomers to present the wikis differently, like instead of presenting them as "this wiki is your portal to documentation", then present the wikis as a "best effort", and instead clearly mention and link to the header source files.

slouken commented 8 months ago

All of your points are valid, I was just clarifying the author's position. I appreciate your digging into this. I think in the long run, we want the wiki documentation to be good quality and a good introduction to the API.

blaisemccourtney commented 8 months ago

@slouken That is reasonable. I am curious if you have any plans or ideas currently on how to get the wiki documentation into a good state, whether it is for SDL2 or SDL3. Without automatic generation of API documentation for the non-function types, my guess is that it will be tough getting the wiki to a good state overall, even in the long run. My proposal in a comment a few comments up would introduce automatic generation of API documentation for non-function types (there is already automatic generation of API documentation for function types currently) if successful, but it would also not have the bidirectional updates of the source (header to wiki & wiki to header), and having users edit a wiki page for a non-function type and then have those changes be thrown away silently during automatic generation is not optimal. I have some ideas on how to deal with that.

One idea would be that each wiki page for a non-function type is "split" into two sections in each page, one from automatic generation, and one manual and limited to the wiki (maybe put into the lower part of the page), and then users can edit the lower part and the upper part comes from automatic generation; this idea would not have bidirectional updating. You could call this idea "two-section wiki API pages".

Another idea would be more limited, just disable editing of non-function type wiki pages, to prevent users from wasting their time and effort, but having some wiki pages (specifically the API wiki pages for non-function types) not being editable would be less nice.

Both of those ideas suffer from the issue that it is inconsistent with the bidirectional updating of the function type wiki pages, which my proposal would not touch.

For both of those ideas, I think you could generate a link to the header file, and maybe even starting line number, where the header documentation is - but I think you could do that already now, honestly. Maybe a PR expanding the wikiheaders.pl script to also track the source header path (and maybe line numbers), and then generate a link in the corresponding wiki pages to the header source file (maybe with line number) in Github. But I am digressing from the topic at hand.

The bidirectional update functionality is neat, but it is probably complicated to implement for non-function types.

icculus commented 7 months ago

The wiki bridge now includes #defines, structs, unions, and enums, so we can have a much better reach with documentation now. Going through the contents of the wiki by hand is probably the next (extremely tedious) step, since there's a lot of cleaning that needs to be done.

The next tedious step after that is going through all the headers and making sure everything reasonable is documented.

icculus commented 5 months ago

The "Missing Links" bullet point is resolved now, for SDL2 and SDL3 (although SDL2 is likely to have a lot of missing pages when you click through).

Now function params and return values are listed on the page outside of the preformatted C code syntax section, generating wikilinks to SDL_* type names.

I'm pretty pleased with how this turned out, and doing this extremely basic parsing of the function signatures lets wikiheaders.pl validate \param and \returns lines, which brought up some errors to clean up.

icculus commented 5 months ago

The Work In Progress disclaimer was manually cleaned out of SDL2 and SDL3 a while ago, too.

Missing pages is almost complete for SDL3, and being tracked in a different bug.

I do not expect we will go back and write a lot of new documentation for SDL2, even for things that could be pasted from SDL3 (but I will merge PRs from others that want to do this work). We are making sure wikiheaders continues to work with SDL2, so while the SDL2 pages are mostly pretty, consistent, functional and automated now, it's going to come down to a lot of documentation just simply not existing until someone writes the SDL2 version of it.

icculus commented 1 month ago

Okay, I think we're totally settled out for SDL3 now, and SDL2's janitorial work is done, but we're likely not going to spend time improving SDL2's docs at this point (but will readily accept PRs that do), so I'm closing this. Thanks!