How To Track Data Schema Violations?

ytausch commented 8 months ago

With PR #2239, the bot will receive a Pydantic model for its internal data schema, which is validated periodically in a CI job against the entire conda-forge dependency graph.

The future path forward is surely to use the Pydantic model in the bot's production code and not only for documentation purposes. The parts of the model that are directly inferred from a feedstock (e.g. conda-forge.yml) should be validated by conda-smithy such that cf-scripts never receives an invalid feedstock to update.

This means cf-scripts could, in the future, use a strict Pydantic schema to work with and just fail fast if anything violates it.

However, we are not there yet since migrating the existing codebase to Pydantic is some non-trivial amount of work. So we need an intermediate solution that should satisfy two goals:

It should be transparent whether the documentation-only Pydantic model is conformant with the actual schema the bot uses implicitly. Also, there should be some kind of nudge to keep it up to date. Since you, @beckermr, already mentioned your timely constraints, this should not be too aggressive (no bot failing, no automatic GitHub issue opening when a package violates the schema since it can also be the upstream feedstock that has an invalid conda-forge.yml file).
You should still be able to track which packages violate the bot schema, allowing us to do two things: First, a package violating the schema could indicate some problem with it and help with debugging. Second, all or most packages violating the schema is a good indicator that the schema was changed and should be updated.

The current idea is to add a new section ("Bot Schema Validation") to the conda-forge status page that classifies each package into 3 categories (names subject to change): good, bad and known bad. Packages are good if their cf-graph files are good (implying that the important fields of meta.yaml and conda-forge.yml are also good), known bad if the bot schema is violated because of invalid data in the feedstock (and not an outdated bot model), and bad if it violates the bot schema (which should be investigated). The known bad list is already part of #2239.

This would allow us to track individual packages as well as the currentness of the Pydantic model.

I am aware of https://github.com/conda-forge/conda-forge.github.io/pull/2090 being about to merge. Of course, any additions to the status page will build upon that.

What do you think about this?

beckermr commented 8 months ago

My only comment is that the bot having a bad schema is not necessarily a user issue, so that putting it on the status page is confusing.

ytausch commented 8 months ago

Maybe we can add a note to the status page explaining exactly this? I think it is still justifiable to put it on the status page since something happening outside the documented behavior is in some sense a degradation. Still open to your concerns, of course.

beckermr commented 8 months ago

Yes I think that'd be ok. To make this happen, we'll need to update the status_report.py file to write the schema violations to a json blob in cf-graph-countyfair. They do not need to be pushed into mongo. Then we can update the status page to pull them and display them.

ytausch commented 4 months ago

@beckermr You seemed to have introduced a PR status check for the model with https://github.com/regro/cf-scripts/pull/2702. My intent was originally to run the check only periodically because a failed model validation is most likely not something that is wrong with a PR. Is there any deeper reason why you did the change?

If not, I propose to only have the periodic check again.

beckermr commented 4 months ago

It is cheap to run and turns out to be useful for debugging. I'd prefer to keep it. The check is not required.

ytausch commented 4 months ago

But can't you debug with the periodic check just as good enough? I know the check is not required but it makes merging PRs with red crosses the norm. Ignoring checks (although they are not required) is not something that one should get used to.

regro / cf-scripts

How To Track Data Schema Violations? #2244