Open mike0sv opened 1 year ago
It has been discussed a few times https://github.com/iterative/dvc.org/issues/2770 before.
My take - this approach creates pretty bad docs (if you could run mlem something --help
and get the same result, you don't need docs at all) or requires significant maintenance (proper docstrings that are hard actually to make so that they satisfy all the requirements that we have for docs - e.g. admons, code blocks, etc, etc).
Even for python API (that makes more sense to me to generate), afair team decided to keep very simple in the code and longer descriptions / examples in docs.
Yes, the idea is to force the simple part to be the same in docs and in code. Longer descriptions and examples will be only in docs with all those fancy md things. We already have tests in code for all classes, options and fields to have docstrings. Just doing PR showed that we 1) had couple of commands left out of the docs because we forgot to add them 2) had a couple of cli options in docs that we deleted from code 3) docs team fixed a lot of wording/spelling/punctuation in docs that was not backported to code (I did the backporting manually in https://github.com/iterative/mlem/pull/363) And you can see that beside those discrepancies, PR didnt actually change anything else like formatting. So it should be best of both worlds - handcrafted docs with automation that checks if they are up to date
From https://github.com/iterative/dvc.org/issues/2770#issuecomment-910482556 : my intention is exactly what @casperdcl wrote in the end: generate subset of (2) from (1)
1) had couple of commands left out of the docs because we forgot to add them 2) had a couple of cli options in docs that we deleted from code
This can be solved by introducing a check. No need to generate or keep source code as a source for docs. I'm not sure how valuable everything else. Tbh from my experience it's still quite rare that we would even benefit from a checks like those.
docs team fixed a lot of wording/spelling/punctuation in docs that was not backported to code (I did the backporting manually in https://github.com/iterative/mlem/pull/363)
this is minor. Usually major work is done by writing proper description of those options. Point here is - if you they are the same as --help
and auto generated, you don't need them at all. In DVC they are far from being the same.
generate subset of (2) from (1)
I'm not sure it's possible tbh, unless I'm missing something. Usually 2 looks quite different from 1.
I think it could be helpful to have a basic CI check that e.g. cml <command> --help
lists the same options as show up in the bullet points in https://cml.dev/doc/ref/<command>#options
for example...
The check you are talking about is almost the same as what I propose. To implement this check there are 2 ways: parse existing options section, find what options are there and compare with what --help
have (extracting them from typer (click) api is even easier), or generate this section from code and compare with existing text. Second approach allow to use same code avoid re-writing all of this manually. If you are not happy with what was generated, you can always fix text in docstings or formatting in generator code
I'm not sure it's possible tbh, unless I'm missing something.
Mmm probably you are. I'm not talking about something like mlem cmd --help > cli-reference/cmd.md
Please take a look at https://github.com/iterative/mlem.ai/pull/172
@mike0sv how do you envision the workflow for this though? it should be a check that you run regularly anyway and then either you generate boilerplate automatically as a PR or fix it manually. If you don't automate this then who is responsible running this.
Anyways, my point is that from my experience this takes time to automate, takes time to maintain, etc, etc and in case of DVC was not solving much. Most of the work goes into writing meaningful option descriptions (neither --help
nor docstrings give them).
Mmm probably you are. I'm not talking about something like mlem cmd --help > cli-reference/cmd.md
I understand that it doesn't generate the whole md file, it generates some parts of it, right? (not sure if it keeps or not options that already exist). And that's exactly what I was talking about- It drives bad docs to my mind.
For every PR in code that changes API / CLI we should be creating a proper PR with docs update. It should have examples, proper description (--help doesn't give it). This process guarantees that we have meaningful docs. Automation can help to check for discrepancies (e.g. run by cron) or bootstrap it the first time (similar to #172).
@shcheklein I understand the concern about docstrings contents restricting the doc site content 🙏 we discussed this offline as well. But it doesn't have to be this way imo. So it's very possible to achieve some automation here without any "new" workflow that would reduce quality. The current alternative is that things become obsolete or are just plain dropped and forgotten, so I think this is undoubtably worse 😄
And that's exactly what I was talking about- It drives bad docs to my mind.
I think it doesn't have to. The generated docs are definitely better than nothing, even if they are just a skeleton for more examples / fleshed out content which requires time and attention. So it's not against that, but automating the repetitive content at least
This is the way I see it at least. So I do suggest we give this a try, dvc docs are more stable and 99% goes to handcrafted content, but mlem is in a different stage and things are more dynamic, this can potentially help guard us from drift between docsite and tool.
For this to be effective I also think we want to automate this somehow - run in a cronjob and generate a suggestion PR every week or so. would be a good reminder and even if not mergable, and we need a man-in-the-loop, it can provide the skeleton for the changes
TL;DR: I'm fine to automate and try (but keep in mind we are spending time on this :) ).
The generated docs are definitely better than nothing, even if they are just a skeleton for more examples / fleshed out content which requires time and attention. So it's not against that, but automating the repetitive content at least
yes. But this is about bootstrapping pretty much? After the project is more or less stable I found it's hard to justify this level of automation (I mean making more and more sophisticated scripts to merge / embed, etc, etc). Everything can be done, but it has its own cost. While 99% time in docs goes into writing content. Creating manually a PR that just copy-pastes things when you change a command is not painful at all unless you change something every day (I doubt that it will be happening).
A bit of reflection on my approach / my thoughts.
I agree the automation might make more sense only for unique tools like CML where most pple don't download it to run --help
locally. But even CML's online command ref goes a bit further than the CLI output... it has better markdown formatting, hyperlinks & URLs to more info, etc.
I'd only automate checking that all subcommands
and --options
exist in the command ref, but not checking the descriptions/wording.
As a potential new user, I find the Python API docs on mlem.ai difficult to work with, as they are not up to date, and could benefit from further typehinting.
For example:
The docs in the code have corrected typos, which make them more intelligible, and only by looking there could I find that fs
is defined by fsspec
and see what filesystems are supported.
mlem.api.save on mlem.ai mlem.api.save in code
While I understand this requires some additional dev work, it may be worth the prioritization. In my case, I am evaluating using mlem/dvc/gto for a model registry, after which I'd like to evaluate Interactive Studio, but I need to get through the docs first ;)
There are a number inconsistencies between mlem docs and actual mlem code. Sometimes it's because of new features that we forgot to add docs for, sometimes it's fixes in docs that are not reflected in mlem code. To make everything as consistent as possible, I suggest to auto-generate everything we can. Of course, a big chunk of docs will remain hand-crafted. I am talking about parts of reference pages for API, CLI and upcoming Objects.
Ideal process:
specification
is generated from mlem codebase in a form of json file. It contains all docs-related stuff from code (docstrings, help messages etc). It's generated from latest mlem version in CIFor now:
I will start with CLI for https://github.com/iterative/mlem/pull/363 and create a PR shortly with examples