Semi-autogenerated docs

mike0sv commented 1 year ago

There are a number inconsistencies between mlem docs and actual mlem code. Sometimes it's because of new features that we forgot to add docs for, sometimes it's fixes in docs that are not reflected in mlem code. To make everything as consistent as possible, I suggest to auto-generate everything we can. Of course, a big chunk of docs will remain hand-crafted. I am talking about parts of reference pages for API, CLI and upcoming Objects.

Ideal process:

for specific part of docs (cli/api/etc) a specification is generated from mlem codebase in a form of json file. It contains all docs-related stuff from code (docstrings, help messages etc). It's generated from latest mlem version in CI
in .md files a special "generate" expression is used (kind like this)
in CI (or maybe even in realtime) those expressions are substituted for actual docs generated from spec

For now:

same as above, but manually (not CI)
.md's stays the same
special script finds parts of .md files that should be autogenerated and replaces their contents with generated from spec
script runs locally, final .md committed

I will start with CLI for https://github.com/iterative/mlem/pull/363 and create a PR shortly with examples

shcheklein commented 1 year ago

It has been discussed a few times https://github.com/iterative/dvc.org/issues/2770 before.

My take - this approach creates pretty bad docs (if you could run mlem something --help and get the same result, you don't need docs at all) or requires significant maintenance (proper docstrings that are hard actually to make so that they satisfy all the requirements that we have for docs - e.g. admons, code blocks, etc, etc).

shcheklein commented 1 year ago

Even for python API (that makes more sense to me to generate), afair team decided to keep very simple in the code and longer descriptions / examples in docs.

mike0sv commented 1 year ago

Yes, the idea is to force the simple part to be the same in docs and in code. Longer descriptions and examples will be only in docs with all those fancy md things. We already have tests in code for all classes, options and fields to have docstrings. Just doing PR showed that we 1) had couple of commands left out of the docs because we forgot to add them 2) had a couple of cli options in docs that we deleted from code 3) docs team fixed a lot of wording/spelling/punctuation in docs that was not backported to code (I did the backporting manually in https://github.com/iterative/mlem/pull/363) And you can see that beside those discrepancies, PR didnt actually change anything else like formatting. So it should be best of both worlds - handcrafted docs with automation that checks if they are up to date

mike0sv commented 1 year ago

From https://github.com/iterative/dvc.org/issues/2770#issuecomment-910482556 : my intention is exactly what @casperdcl wrote in the end: generate subset of (2) from (1)

shcheklein commented 1 year ago

1) had couple of commands left out of the docs because we forgot to add them 2) had a couple of cli options in docs that we deleted from code

This can be solved by introducing a check. No need to generate or keep source code as a source for docs. I'm not sure how valuable everything else. Tbh from my experience it's still quite rare that we would even benefit from a checks like those.

docs team fixed a lot of wording/spelling/punctuation in docs that was not backported to code (I did the backporting manually in https://github.com/iterative/mlem/pull/363)

this is minor. Usually major work is done by writing proper description of those options. Point here is - if you they are the same as --help and auto generated, you don't need them at all. In DVC they are far from being the same.

generate subset of (2) from (1)

I'm not sure it's possible tbh, unless I'm missing something. Usually 2 looks quite different from 1.

casperdcl commented 1 year ago

I think it could be helpful to have a basic CI check that e.g. cml <command> --help lists the same options as show up in the bullet points in https://cml.dev/doc/ref/<command>#options for example...

mike0sv commented 1 year ago

The check you are talking about is almost the same as what I propose. To implement this check there are 2 ways: parse existing options section, find what options are there and compare with what --help have (extracting them from typer (click) api is even easier), or generate this section from code and compare with existing text. Second approach allow to use same code avoid re-writing all of this manually. If you are not happy with what was generated, you can always fix text in docstings or formatting in generator code

mike0sv commented 1 year ago

I'm not sure it's possible tbh, unless I'm missing something.

Mmm probably you are. I'm not talking about something like mlem cmd --help > cli-reference/cmd.md Please take a look at https://github.com/iterative/mlem.ai/pull/172

shcheklein commented 1 year ago

@mike0sv how do you envision the workflow for this though? it should be a check that you run regularly anyway and then either you generate boilerplate automatically as a PR or fix it manually. If you don't automate this then who is responsible running this.

Anyways, my point is that from my experience this takes time to automate, takes time to maintain, etc, etc and in case of DVC was not solving much. Most of the work goes into writing meaningful option descriptions (neither --help nor docstrings give them).

Mmm probably you are. I'm not talking about something like mlem cmd --help > cli-reference/cmd.md

I understand that it doesn't generate the whole md file, it generates some parts of it, right? (not sure if it keeps or not options that already exist). And that's exactly what I was talking about- It drives bad docs to my mind.

For every PR in code that changes API / CLI we should be creating a proper PR with docs update. It should have examples, proper description (--help doesn't give it). This process guarantees that we have meaningful docs. Automation can help to check for discrepancies (e.g. run by cron) or bootstrap it the first time (similar to #172).

omesser commented 1 year ago

@shcheklein I understand the concern about docstrings contents restricting the doc site content 🙏 we discussed this offline as well. But it doesn't have to be this way imo. So it's very possible to achieve some automation here without any "new" workflow that would reduce quality. The current alternative is that things become obsolete or are just plain dropped and forgotten, so I think this is undoubtably worse 😄

And that's exactly what I was talking about- It drives bad docs to my mind.

I think it doesn't have to. The generated docs are definitely better than nothing, even if they are just a skeleton for more examples / fleshed out content which requires time and attention. So it's not against that, but automating the repetitive content at least

This is the way I see it at least. So I do suggest we give this a try, dvc docs are more stable and 99% goes to handcrafted content, but mlem is in a different stage and things are more dynamic, this can potentially help guard us from drift between docsite and tool.

For this to be effective I also think we want to automate this somehow - run in a cronjob and generate a suggestion PR every week or so. would be a good reminder and even if not mergable, and we need a man-in-the-loop, it can provide the skeleton for the changes

shcheklein commented 1 year ago

TL;DR: I'm fine to automate and try (but keep in mind we are spending time on this :) ).

The generated docs are definitely better than nothing, even if they are just a skeleton for more examples / fleshed out content which requires time and attention. So it's not against that, but automating the repetitive content at least

yes. But this is about bootstrapping pretty much? After the project is more or less stable I found it's hard to justify this level of automation (I mean making more and more sophisticated scripts to merge / embed, etc, etc). Everything can be done, but it has its own cost. While 99% time in docs goes into writing content. Creating manually a PR that just copy-pastes things when you change a command is not painful at all unless you change something every day (I doubt that it will be happening).

A bit of reflection on my approach / my thoughts.

Personal perception. I'm quite annoyed when I come to docs and only thing I see is a copy-paste of some existing content (I got in an IDE already, or I got it in CLI already). My feeling is exactly like that "folks automated and forgot about this since it's good enough". My feeling usually is that creators don't try to make my life easier.
It creates a false feeling of completeness / existence of docs, and there will be less incentive to allocating time on improving it. I hope an alternative can be really light weight (e.g. you do one option per week, one small document per week, etc) and it can get us very far.
Writing is an essential and super important skill for every engineer.

casperdcl commented 1 year ago

I agree the automation might make more sense only for unique tools like CML where most pple don't download it to run --help locally. But even CML's online command ref goes a bit further than the CLI output... it has better markdown formatting, hyperlinks & URLs to more info, etc.

I'd only automate checking that all subcommands and --options exist in the command ref, but not checking the descriptions/wording.

ryanjdillon commented 1 year ago

As a potential new user, I find the Python API docs on mlem.ai difficult to work with, as they are not up to date, and could benefit from further typehinting.

For example: The docs in the code have corrected typos, which make them more intelligible, and only by looking there could I find that fs is defined by fsspec and see what filesystems are supported.

mlem.api.save on mlem.ai mlem.api.save in code

While I understand this requires some additional dev work, it may be worth the prioritization. In my case, I am evaluating using mlem/dvc/gto for a model registry, after which I'd like to evaluate Interactive Studio, but I need to get through the docs first ;)

iterative / mlem.ai

Semi-autogenerated docs #171