Open hannesm opened 1 year ago
I would really recommend to separate the ideas of package ecosystem health check from package availability in opam-repository. Dropping (versions of) packages from the repo will lead to a decline in trust. But dropping them from a separate opam ecosystem health check will be completely understandable because there are only limited resources at the end of the day.
Scala does this: the package registry ( https://index.scala-lang.org/ ) is separate from the community build ( https://github.com/VirtusLab/community-build3 )
This leaves the legitimate concern of expressing to users that a package has been checked as healthy or not, trying to minimize frustration with low-quality packages. We could implement a 'blue check' next to packages (+versions) which are in the proposed opam community build and passing the health check, and perhaps conversely separate icons for packages which are not in the community build or which are not passing health checks.
Based on this the community build could be super simple–maybe just the latest version of each package on the latest compiler version. Or expand to more versions if desired.
I'm not sure to understand. What do you mean by "package ecosystem health check" ?
I mean:
opam-repo-ci and opam-health-check can detect broken packages that we would want to move to the archive repository first
So my suggestion is that, instead of moving any packages/versions into an archive, just remove them from the set of packages that are checked. Ie have a limited set of checked packages–maybe the most-downloaded ones? And give them a nice 'blue check' logo in the OCaml.org package index. This shows that the opam health check has been run on them. For the packages outside this set, remove the guarantee that the health check has been run.
the goal of this discussion is not to just reduce CI load (it is actually a really small part of it), but to find a solution to the infinite growth of the opam-repository. You can check the stated interested parties in the first meeting notes:
https://github.com/ocaml/opam-repository/issues/23789#issuecomment-1909693700
removing them from the set of packages that are check would not reduce the load of extra work for users of the opam tool (performance issues + packages are still visible), publishers (more time downloading the repository) and opam-repository maintainers (getting lost in an infinite repository)
@kit-ty-kate, if I remember correctly, a few parties agreed with @yawaramin in that first meeting -- it was mentioned that removing packages from the repository is not the only way to scale our ecosystem -- a lot of package ecosystem lives very well with a few orders of magnitude more packages than opam
(I remember @dra27 saying that we should aim for our tools to support as many packages as npm
instead of trying to reduce the total number of packages in opam
and I agreed with him).
While I'm not opposed to splitting the repositories into various parts to ease some of the maintenance burden in the short term, I really would also like us to think about what will happen next and what needs to be done to ease the lives of opam-repository maintainers (do we need more maintainers? if yes, what is blocking people to contribute here?) and to reduce the size/cost/power consumption of our CI cluster.
Edit: Also, it would be helpful to be super clear and state that "growing the number of available packages in opam" is an essential goal of the opam-repository team -- as this is a proxy metric of a healthy ecosystem.
While I'm not opposed to splitting the repositories into various parts to ease some of the maintenance burden in the short term, I really would also like us to think about what will happen next and what needs to be done to ease the lives of opam-repository maintainers
Exactly my thought process here. If we are going to chop up the opam-repository into two or more slices, we should recognize that this is a temporary solution and in the long run we need a more practical solution which does not require more chopping up in the future.
It seems that the main issue is that as the repository scales up, more data is being moved around and stored, which is making opam clients and other processes more inefficient. I think we need to find a repository format that does not scale essentially linearly with the number of packages and that is efficient to query, download, and install packages even as it continues to scale up.
Trees, anyone?
One thing I've wanted from the very beginning of opam was the ability to refer to multiple repositories from within the metadata. @samoht ran out of time to build it into opam 1.0, but we're only feeling the pain a decade down the road ;-) It would be really nice if we could clone a lightweight "latest version of opam repository" but then bringing in other overlay repositories.
Ideally if we break up opam-repository into multiple versions, it should be possible to have a reasonable way to build combined versions without having to remember all the various repository URLs. Perhaps all that is needed is an opam-repo CLI tool that can combine/split up an opam repo into multiple repositories, and rewrite metadata appropriately so that they are either standalone or overlays. Most of the opam CIs have had this ability to distinguish standalone opam repos from overlay ones designed to be merged with an existing repo.
@raphael-proust to answer your questions above:
I don't think I understand the merge-queue process very well. I get the general idea, but I fail to see how it makes life easier for opam repository maintainers. Specifically, I don't understand: they can safely send reasonable-looking PRs to the merge queue without blocking on complete reverse dependency tests Why don't we need to block on rev-deps tests? Are we just sending everything to the mergeq and then writing fixes to it afterwards based on the results of the mergeq CI? Is it meant to be simpler because we can write a batch fix for multiple submissions or is there another reason that I missed?
That's exactly the reason why it's easier. The opam-repo is one big datastructure, and when maintainers apply fixes, they are applied to a batch of packages. Therefore the CI should be building the one datastructure that we intend to merge, and not individual changes. From the maintainer perspective, there will just be a "queue of incoming packages" and then we look at breakages and push the right metadata to make that work into the mergeq.
Right now it's very laborious to spot some revdeps breakage, then open a separate PR to fix those, then look for more breakage uncovered, repeat a few times, and it takes 5-6 PRs to merge a package, and a huge amount of unnecessary CI work. My goal with the mergeq is to reduce the amount of CI time needed by an order of magnitude, by removing all the duplicated work due to the lack of synchronisation around branches.
Is this automation for the previous bullet point (i.e., automation for fixing the revdeps failure that happen in the mergeq because we merge without checking them)? Who's maintaining it?
This will be the opam-repo maintainers, just as it happens today. Except instead of pushing revdeps failures in separate PRs and branches, you should focus the effort on the merge queue. When it passes, it becomes the main branch, and a new set of PRs can be chucked in the queue.
What does "pass" mean? That's why at the very beginning of the doc (https://github.com/avsm/opam-repo-roadmap-thoughts) before anything about merge queues, I put down a set of candidate metrics for how to measure the opam-repo health. I'm looking forward to reading your draft resolution to see how they align up -- I no doubt missed a bunch of ideas that are coming to light from your meetings.
(unfortunately, I've got a standing conflict at 2pm UK time with our faculty meetings at the university, so I can never make the community meetings you are organising during the academic term. I may be able to attend a synchronous meeting at some other time)
Dear all,
thanks for continuing the discussion. I attach here the notes of last meeting. In particular I would like to stress the following point from the conclusions:
it would be nice for more people to be present at the next meeting to discuss the push-backs discussed on the ticket and arrive at a more shared and comfortable consensus. For this purpose here is a framadate link to find the best time and date for the next meeting: https://framadate.org/qD2Pb57B7h6xJ8U4 Please fill this poll as soon as possible to we can set the date for the next meeting.
Public meeting on 2024-02-21.
Were present during this meeting: Marcello, Hannes, Shiwei, Kate, rjbou
the draft is not yet fixed but was discussed with and agreed by the people present at the meeting
it would be nice for more people to be present at the next meeting to discuss the push-backs discussed on the ticket and arrive at a more shared and comfortable consensus. For this purpose here is a framadate link to find the best time and date for the next meeting: https://framadate.org/qD2Pb57B7h6xJ8U4 Please fill this poll as soon as possible to we can set the date for the next meeting.
ocaml/opam-repository-archive
github repository
repo
file"dep" {... & <= "<latest version of dep available in opam-repository at the time of the PR or lower>"}
except for "ocaml"
patch --version | grep gnu
) or git
(which is the best option and there is already and open PR for this that would be nice to have backported to opam 2.1)before-opam-repository-archive
tag on opam-repository before any of the removalocaml
packages < 4.05 (oldest version still in use in the wild according to https://repology.org/project/ocaml/versions) to the opam-repository-archive
opam admin check --installability
, make sure the output is correct and move all the packages it lists to opam-repository-archive
x-maintenance-intent: "latest"
the maintainer will only maintain the latest versionx-maintenance-intent: "1-major"
the maintainer will only maintain the latest X(.Y)?(.Z)? version and (X-1)(.Y)?(.Z)?x-maintenance-intent: "1-minor"
the maintainer will only maintain the latest X.Y(.Z)? version and X.(Y-1)(.Z)?x-maintenance-intent: "all-major"
the maintainer will maintain every X for each X(.Y)?(.Z)?x-maintenance-intent: "all-minor"
the maintainer will maintain every Y for each X.Y(.Z)? (where X is the latest)x-maintenance-intent: "all"
the maintainer will maintain every single versionsx-maintenance-intent
field and lists every versions that are not maintained anymorex-maintenance-indent
will be considered for removal when the times come, the others will stayFor this purpose here is a framadate link to find the best time and date for the next meeting: https://framadate.org/qD2Pb57B7h6xJ8U4
Reminder for everyone who wants to come to the next meeting to please fill the form as soon as you know when you are available, so that we can plan when the meeting is going to be.
Based on the above poll, it looks like most of the main interested parties so far are available on:
so we propose to have the next public meeting at that time and date on https://meet.jit.si/opam-repo-meeting
Present: Kate, Anil, Marcello, reynir, Thomas, rjbou, Shon, Hannes, Shiwei, Ryan
Yawar was marked present but was in fact victim of a bug, see https://discuss.ocaml.org/t/discussions-on-the-future-of-the-opam-repository/13898/11, if it happens again in the future for anyone else, please ping us on Slack or Discord. We'll also try to keep a look at the Discuss page for people who have neither.
Clarifications about the draft:
opam admin check --installable
be removed directly?
Communications will need to be modified to include some quality metrics (installability, etc.) where it has historically focused on quantity ("we have that many packages and it's growing")
In addition to the repository health (size, scalability, etc.) there are also CI cost concerns in terms of energy use
Should retired packages have their documentation built for ocaml.org?
Actionable: talk to the CI team responsible for the docs-ci in ocaml.org and see what they say.
If we want to be able to publish users' package faster/more npm-like manner, one of the values for the fields that specify the maintenance intent could be "don't touch my package ever, don't change metadata, nothing"; in this case the package could be moved immediately to archive as soon as it is broken.
When a package is moved to the archive it would be good to have a comment/commit-message/x-field/… that tells people why the package has been moved to the archive
ocaml/opam-repository-archive
github repository
repo
fileopam repository add …
)x-
fields in the files"dep" {... & <= "<latest version of dep available in opam-repository at the time of the PR or lower>"}
except for "ocaml"
x-reason-for-archival
is setx-opam-repository-commit-hash-at-time-of-archival
is setopam admin check --installable
rm -r ~/.opam/repo/default/ && opam update default
)before-opam-repository-archive
tag on opam-repository before any of the removalocaml
packages < 4.05 (oldest version still in use in the wild according to https://repology.org/project/ocaml/versions) to the opam-repository-archive
opam admin check --installability
, make sure the output is correct and move all the packages it lists to opam-repository-archive
x-maintenance-intent
field:
(latest)
, (any)
and (none)
["(latest)"]
the maintainer will only maintain the latest version["(latest)" "(latest-1)"]
the maintainer will only maintain the latest X.Y.Z
version and (X-1).Y.Z
["(latest)" "(latest).(latest-1)"]
the maintainer will only maintain the latest X.Y.Z
version and X.(Y-1).Z
["(any).(latest)"]
the maintainer will maintain every major version X for each X.Y.Z["(latest).(any).(latest)"]
the maintainer will maintain every Y for each X.Y.Z (where X is the latest)["(any)"]
the maintainer will maintain every single versions["(none)"]
the maintainer will not maintain any versionx-maintenance-intent
will be considered for removal when the times come, the others will stayx-maintenance-intent
field and lists every versions that are not maintained anymoreocaml
packages < 4.08 (oldest version of OCaml in use for maintained distributions according to https://repology.org/project/ocaml/versions) to the opam-repository-archive
opam admin check --installability
, make sure the output is correct and move all the packages it lists to opam-repository-archive
The next meeting will be on the same day / time again, the week after the next one. If that doesn't fit with your calendar this time, please don't hesitate to tell us.
I haven't been able to attend the meetings due to time zone and life constraints but the plan seems great. Thanks for the hard work.
The meeting is starting now, is anyone else coming?
Hello, Thanks for this discussion!
Sorry I am late. I don't have much to contribute on the cleanup plan, thank you for the work there!
I'd like to comment on the future of opam and its growth, and bring up an idea for us to consider. It's in the realm of creating more flexible workflows for users, in a way that scales up. I'm not sure if it's been mentioned before, but I find it interesting and potentially beneficial for the future growth of opam.
What I was thinking of was to create a new repository for opam, which would list custom opam repositories that various users have (examples here, here, etc).
This "meta" repository would be a registry of link towards other repos, so in essence a map from a name, to a git URL. This is an "opam-repos-registry" rather than a "opam-packages-registry".
I am imagining that the process of adding a new entry to this repo could be similar to that of the existing main opam repository. Users would add an entry with a link to their repo, reserving a part of a new global namespace composed of the repo name (GitHub username, company name, etc.) and the name of individual packages defined there.
The barrier for inclusion could include a linting step for the added configuration files, with some basic verification that the URL given is indeed pointing to a valid opam-repository. We could certainly draw inspiration from the current process in the main opam-repository for deciding what other considerations are important before merging PRs.
By design, the opam maintainers would defer to the maintainer of each repo for defining the policy on maintenance, lifetime, and quality of packages listed in their custom repo.
This is not a fully fleshed-out proposal, it's an idea that we could implement now. It could provide useful data from the community and potentially be part of a larger, incremental solution. If the repo gains traction, we could consider modifying some tools to gradually expose its data. For example including the packages there into tools like sherlodoc
, etc.
In a world where you'll soon be able to specify package dependencies directly with dune
without requiring packages to be listed centrally, this could also offer a way for packages living outside the main repo to be discovered. Additionally, the unique repo names defined in the repos-registry would provide a canonical reference for each package.
I'm open to discussing this further if some find this idea appealing. I believe it has several beneficial properties, including the fact that it doesn't require immediate changes to existing tools (opam handles custom opam-repositories beautifully), and it relates to the topics discussed here.
And just to be clear, despite the timing of this post, this isn't an April Fools' joke. Unless the joke is on me and this idea has already been discussed or such meta repo already exists and I am just not aware of it!! :smile:
Dear all, here are the notes from the last meeting. Sorry for the lengthy delay.
Present: Kate, Marcello, rjbou, Ryan
git apply
instead of gpatch
has been merged in opam 2.2 and a backport PR is already available for the upcoming opam 2.1.6. Once the new cygwin is released in ~1 week, we will get some testing via opam 2.2 and if everything works fine the patch for opam 2.1 will be merged and released. This point has been moved to Phase 0.cohttp
will maintain also version 2.X
for the time beingx-external-maintainer
x-external-maintainer
is carried over from one package to the next and possibly a warning to ensure it is present if specific versions are specifiedWrite and agree on a new policy for opam-repository
[ ] Reach out the infra team to make sure they are ok with the proposed document
Write a discuss post to announce the new policy
make sure everyone is using a version of opam that does not break when files are deleted, e.g. the upcoming opam 2.1.6 or 2.2.0, and give an alternative for people using older opam (e.g., rm -r ~/.opam/repo/default/ && opam update default
)
ocaml/opam-repository-archive
github repository
repo
fileopam repository add …
)x-
fields in the files"dep" {... & <= "<latest version of dep available in opam-repository at the time of the PR or lower>"}
except for "ocaml"
x-reason-for-archival
is setx-opam-repository-commit-hash-at-time-of-archival
is setopam admin check --installable
before-opam-repository-archive
tag on opam-repository before any of the removalocaml
packages < 4.05 (oldest version still in use in the wild according to https://repology.org/project/ocaml/versions) to the opam-repository-archive
opam admin check --installability
, make sure the output is correct and move all the packages it lists to opam-repository-archive
x-maintenance-intent
field:
(latest)
, (any)
and (none)
, or specific version numbers["(latest)"]
the maintainer will only maintain the latest version["(latest)" "(latest-1)"]
the maintainer will only maintain the latest X.Y.Z
version and (X-1).Y.Z
["(latest)" "(latest).(latest-1)"]
the maintainer will only maintain the latest X.Y.Z
version and X.(Y-1).Z
["(any).(latest)"]
the maintainer will maintain every major version X for each X.Y.Z["(latest).(any).(latest)"]
the maintainer will maintain every Y for each X.Y.Z (where X is the latest)["(any)"]
the maintainer will maintain every single versions["(none)"]
the maintainer will not maintain any version["1.3"]
the maintainer will maintain the latest version of "1.3.Z"["2.(latest)"]
the maintainer will maintain the latest minor version specifically of version "2" of the packagex-maintenance-intent
will be considered for removal when the times come, the others will stayx-maintenance-intent
field and lists every versions that are not maintained anymoreocaml
packages < 4.08 (oldest version of OCaml in use for maintained distributions according to https://repology.org/project/ocaml/versions) to the opam-repository-archive
opam admin check --installability
, make sure the output is correct and move all the packages it lists to opam-repository-archive
Dear everyone, thanks for being involved and writing down the very nice plan. I now wonder what is the timeline? And is there anything I can do to move this?
This is just an update to let you know that things are moving slowly but moving. You can see below the updated plan, we hope to be able to start and release a roadmap "soon".
The opam-repository
is steadily growing, using a substantial amount of space and inodes. Yet a lot of packages have become stale or uninstallable. We could slim it sensibly and reduce power waste in the CI by starting pruning the uninstallable and unmaintained packges.
Write and agree on the new policy for opam-repository
[x] make sure everyone is using a version of opam >= to 2.1.6 or 2.2.0
ocaml/opam-repository-archive
github repository
repo
fileopam repository add …
)x-
fields in the filesopam admin check --installable
before-opam-repository-archive
tag on opam-repository before any of the removal"dep" {... & <= "<latest version of dep available in opam-repository at the time of the PR or lower>"}
except for "ocaml"
(needed for phase 2)x-reason-for-archival
is set (needed for phase 3)x-opam-repository-commit-hash-at-time-of-archival
is set (needed for phase 3)opam admin check --installability
, make sure the output is correct and move all the packages it lists to opam-repository-archive
#opamocaml
packages < 4.08 (oldest version still in use in the wild according to https://repology.org/project/ocaml/versions) to the opam-repository-archive
opam-repository
: the supported OCaml version is >= 4.08 #ciThe above steps will be repeated each time we go on with this point with a more recent version of the compiler as bound.
x-maintenance-intent
will be considered for removal when the times come, the others will stayx-maintenance-intent
field:
(latest)
, (any)
and (none)
, or specific version numbers["(latest)"]
the maintainer will only maintain the latest version["(latest)" "(latest-1)"]
the maintainer will only maintain the latest X.Y.Z
version and (X-1).Y.Z
["(latest)" "(latest).(latest-1)"]
the maintainer will only maintain the latest X.Y.Z
version and X.(Y-1).Z
["(any).(latest)"]
the maintainer will maintain every major version X for each X.Y.Z["(latest).(any).(latest)"]
the maintainer will maintain every Y for each X.Y.Z (where X is the latest)["(any)"]
the maintainer will maintain every single versions["(none)"]
the maintainer will not maintain any version["1.3"]
the maintainer will maintain the latest version of "1.3.Z"["2.(latest)"]
the maintainer will maintain the latest minor version specifically of version "2" of the packageopam-repository
package, takes the latest version of each package that has the x-maintenance-intent
field and lists all versions that are not maintained anymore #tools #ciopam-repository
and adds it to the opam-repository-archive
opam-repository
package, takes the latest version of each package that has the x-maintenance-intent
field and lists all versions that are not maintained anymore #tools #ciHello, all :wave:
We have drafted a policy that seeks to explicate and formalize the recurring practices and stable criteria described in the plan. You can read the draft here: https://github.com/ocaml/opam-repository/wiki/Package-Archiving:-Policy
Review and critique by any interested parties would be appreciated. Comments may be left that document via HackMD, or make on this issues.
Please let me know if you have trouble accessing the document!
:sloth:
Thanks for writing this up. The only thing I stumbled upon is the mention of "OCaml platform" out of the blue. Just to be clear, in my perspective this is all about the opam-repository and scaling issues, and would be glad if we keep the topic & perspective clear.
Thanks for the review. I removed that phrase.
I've moved the policy and plan to
Thanks a lot!!
I've observed over the years that there's the sentiment that "no package shall be removed from opam-repository" (I still don't quite understand the resoning behind it -- maybe it is
lock
files that fail to register the git commit of the opam-repository? Maybe it is for some platforms that require the opam-repository for removing opam packages?So, I'd like to raise the question why this sentiment exists. Answers (or pointers to arguments) are highly welcome.
Why am I asking this? Well, several years back Louis worked on "removing packages that aren't ever being installed" (with the reasoning that if foo in version 6.0.0 and in version 6.0.1 (with the very same dependencies) are available, 6.0.1 is always chosen (which makes sense due to it may very well be a bugfix).
Now, I also observe that rarely, really rarely old package releases get bumped their minor version -- i.e. the "long term support" of opam packages does not really exist (happy if you prove me wrong): bug fixes are incorporated with new features and API changes, and so new (major) versions are released.
Taking a step back, it is pretty clear that collecting more and more packages will lead to larger amount of work for the solver (which needs all the packages being parsed, and then find a solution). This means that the solver needs to improve speed-wise roughly every year. This is rather challenging, and in the end leads to opam not being very usable on smaller computers (Raspberry PI, or even tinier computers...).
Also with carbon footprint in mind, I fail to see why opam-repository may not delete packages. In the end, it is a git repository -- if you wish to install compiler in an arcane version, it is fine to roll back your opam-repository git commit to an arcane commit that fits nicely. The amount of work for the opam-repo CI could as well be minimized by removing old opam packages (since a revdep run will be much smaller).
Please bear with me if you've already answered this, and have written up the design rationale (and maybe the answer how opam-repository will scale). Comments and feedback is welcome. Once I understand the reasoning, it'll be much easier for me to figure out how to move forward. Thanks a lot. //cc @AltGr @avsm @kit-ty-kate @mseri @samoht