Closed david-a-wheeler closed 3 years ago
More discussion here: https://news.ycombinator.com/item?id=25003387
I read the links you gave and a few other articles. GitHub Actions looks like a good choice. It's a good thing our current .yml file is not too complex.
By the way: it is a nice thing that there are several proof verifiers available (this gives more confidence that all proofs are correct), but do we need to run them all each time ? Looks like a waste of energy (metamath.c and mmj2 should be sufficient).
I read the links you gave and a few other articles. GitHub Actions looks like a good choice. It's a good thing our current .yml file is not too complex.
Thanks. I was thinking about posting to the mailing list, since not everyone notices new issues.
By the way: it is a nice thing that there are several proof verifiers available (this gives more confidence that all proofs are correct), but do we need to run them all each time ? Looks like a waste of energy (metamath.c and mmj2 should be sufficient).
They all need to approve a change if any proof changed. That way, we'll know immediately if there is a problem.
Technically we don't need to run them if we are certain that there could be no impact, but that can be challenging to verify. We could probably detect a few easy cases; that would also speed up verification time, so it might be worth doing.
We could run them in sequence & stop when one fails; that would reduce power but dramatically increase the time to verify.
Modern CPUs don't use that much power. Metamath.exe and the Rust one in particular only take seconds. Those kind of optimizations are unlikely to be helpful to anyone.
They all need to approve a change if any proof changed. That way, we'll know immediately if there is a problem.
It only takes one correct verifier to pass the input file for the file to be verified. If we wanted to optimize for this, we should run the rust verifier (which is the most efficient) only and then be done. We should also run metamath.exe because it does style checks, and mmj2 because it does definition checking, and maybe at that point the rust verifier is already redundant so it can be removed from the pipeline.
This is the set.mm
repository, not a test suite for the verifiers. The goal here is to find problems in set.mm, not to check that the verifiers work, so it's okay to assume that the verifiers work, run the good ones only, and report CI success. If we want to test the verifiers there is another repo for that, https://github.com/david-a-wheeler/metamath-test .
A while ago I proposed starting a community project to write a verifier that would satisfy all the constraints of the CI system in terms of capabilities: checking all the typesetting stuff, definition checking, and full verification as efficiently as possible, but if it just ends up as one more verifier on the CI list then it's a step backwards, so it is important to solve this first. Perhaps this should be brought up on the mailing list for discussion.
It only takes one correct verifier to pass the input file for the file to be verified.
No, that's only true if we assume that verifiers never have errors. That's almost certainly a false claim. One of the key advantages of Metamath over other systems is that there are many verifiers available.
This is the set.mm repository, not a test suite for the verifiers. The goal here is to find problems in set.mm, not to check that the verifiers work
The point of running multiple verifiers has nothing to do with determining if the verifiers work. To do that, we'd have to have a large set of different databases to check different cases.
The point is to have high confidence that the results are proven correct. Using multiple different verifiers, implemented by different people in different languages, provides much higher confidence - it's like a multi-person peer review.
We have specifically saying we're doing this for years, as a Metamath differentiator. It would be terrible to drop that substantial advantage. It'd be rediculous to drop that advantage for a penny of power.
That's what I was thinking of: running metamath.c and mmj2 provides enough confidence in the correctness of the proofs. A mass verification using more verifiers could be done yearly, for instance, or only when the develop branch is merged into the master branch.
Not running all the verifiers every time is not dropping the advantage of being able to use them from time to time.
That being said, the gain in energy (and carbon footprint) is indeed tiny, so if @david-a-wheeler feels strongly, let's keep it that way. The most important thing is to keep the .yml file simple (e.g.: theoretically, an idea could be to check which proofs have been modified by the pull requests, and only verify those, but this would over complexify the .yml).
@digama0 : do you have some more details (like a link) to this project of a CI-compatible verifier ?
I do feel strongly about using multiple veriifiers. It's very important to detect problems "as early as possible". Running multiple verifiers means that we immediately know of some problem. A once-a-year check means that on average subtle problems won't be detected for 6 months. Reducing power use is good, but this is too tiny to matter.
We can make it simpler. The CI pipeline was created "as we went"; if we're going to move it anyway, it should be easy to make it simpler.
@digama0 : do you have some more details (like a link) to this project of a CI-compatible verifier ?
It was a private message send to Norm & David, to gauge interest. David was somewhat against for the same reasons as mentioned here. The mailing list conversation I would link to doesn't exist yet, but I'm inclined to start it now.
I do feel strongly about using multiple veriifiers. It's very important to detect problems "as early as possible". Running multiple verifiers means that we immediately know of some problem. A once-a-year check means that on average subtle problems won't be detected for 6 months. Reducing power use is good, but this is too tiny to matter.
The "threat model" here is confused. What are we checking? Are we trying to find out if set.mm has an error, or are we trying to find out if a verifier has a bug? I'm assuming we want the first one for the set.mm repo, and for that, as I said, the CI should verify the statement "assuming the verifier(s) are correct, set.mm is valid". The test suite repo can operate at a much lower frequency than set.mm (which has lots of side branches and works in progress too), and is intended to test the claim "the verifiers are correct" by applying them to different databases. Together we obtain the "set.mm is valid" claim of interest.
Personally, I find the energy question less interesting than the speed. When I sell metamath to others, the number one benefit is not the absolute guarantee of truth, but rather the fact that it's 5 orders of magnitude faster than the competition. The lean mathlib library takes up to 2.5 hours to complete a CI run. I would really like metamath CI to show off how fast it can be, but that's not going to happen as long as crappy verifiers stay in the pipeline.
(I care about the absolute guarantee of truth, of course, but a deep dive into that question is what lead me to MM0, so... yeah.)
I don't feel confused by the threat model. We’re trying to determine if set.mm has an error, as Immediately as possible, even though some verifiers may have errors. We gain very high confidence that the results are correct by running multiple verifiers.
We’re not specifically trying to determine if a verifier has a bug. Doing that would require the use of a large test suite, and while that’d be great, that isn’t the point in view here.
The “crappiest” verifier takes something like 1/100 the time. Nothing takes anywhere near an hour, never mind 2.5 hours. There really is no contest.
Wanting "even faster than now" is reasonable enough, but there are other ways to do that. Travis CI is financially struggling, which is causing CI startup delays; I think it'll be faster switching to GitHub. There are some other startup delays that we haven't bothered to remove, but we could. We could further split up the work of the Python checker; that's the slow one & it's trivially parallelizable.
even though some verifiers may have errors
What errors? If a verifier has an error I don't want it in CI. All the verifiers we currently run have extremely low probability of error because they are simple and have been tested for a long time. The chance of an error in the verifier is many orders of magnitude smaller than the chance that the latest addition to set.mm has a problem in it, because that's what the CI is supposed to be checking. Certainly the probability is lopsided enough that we can offload the full verifier suite to something with a lower frequency, like a nightly.
The problem with your approach is that it makes it impossible to improve the CI. If we were to "parallelize the python checker" it would just be a new verifier, we'd add it to CI in addition to everything else, and CI would just get slower. But hey, now we've got more confidence in the correctness of something about which I already had no doubt, so there's that.
If we want to determine if set.mm has an error as soon as possible, we should run only the best and fastest verifier, which does all the checks that we want CI to report. That's the verifier I want to build, but it defeats the purpose if it's not replacing the crappy verifiers.
(I care about the absolute guarantee of truth, of course, but a deep dive into that question is what lead me to MM0, so... yeah.)
Oh, dear. I thought that you're a philosophy student, aren't you? We are talking about consistency of some simplistic mathematical model of "truth", right? I think there should be at least quotes around "absolute" :smile:
Oh, I've got several pages of quotation marks in the MM0 paper. :smile: Honestly it makes a lot more sense to talk about threat models and quantifying and reducing errors here than striving for an absolute guarantee of truth, which doesn't consider what the purpose of the attempt is. CI is just one small part of the overall correctness claim, it doesn't have to do everything itself, which is why I think cranking the correctness dial up to 11 on CI is unnecessary. It's not in a position to deliver that in any case.
@digama0 : do you have some more details (like a link) to this project of a CI-compatible verifier ?
It was a private message send to Norm & David, to gauge interest. David was somewhat against for the same reasons as mentioned here. The mailing list conversation I would link to doesn't exist yet, but I'm inclined to start it now.
If that's not too much work, I'd like to see it in the mailing list too.
Some of the questions in this issue were already touched upon in Issue #1202, and some comments are still worth reading.
Given the complexity of checking MM, my target time for CI is around 10 seconds, not counting startup and install time. Right now I think it's around 3 minutes, which is much worse than it has to be. I'm not comparing lean to MM 1 for 1, that would be silly. But it's not delivering the mentioned 5 orders of magnitude at all. Lean does way more than MM, and it pays appropriately for it. If MM is only slightly faster for doing way less, that's just bad software engineering.
What errors? If a verifier has an error I don't want it in CI.
No such verifiers in any language exist. Metemath.exe isn't. Coq isn't. Lean isn't. ACL2 isn't. Even if I prove it, I don't know if its verifier is correct, nor do I know if its spec is correct in the sense of "what I meant, not what I said".
All the verifiers we currently run have extremely low probability of error because they are simple and have been tested for a long time. The chance of an error in the verifier is many orders of magnitude smaller than the chance that the latest addition to set.mm has a problem in it, because that's what the CI is supposed to be checking.
I agree with the relative probability. But the probability of error across multiple verifiers is dramatically smaller than 1 verifier.
Certainly the probability is lopsided enough that we can offload the full verifier suite to something with a lower frequency, like a nightly.
That's an alternative that I could see. That will complicate the CI process, which I thought we wanted to keep simple.
The problem with your approach is that it makes it impossible to improve the CI. If we were to "parallelize the python checker" it would just be a new verifier, we'd add it to CI in addition to everything else, and CI would just get slower.
There's practically no value in reducing the verification speed below 60 seconds within the CI process. There's value in reducing verification speed to 1 second for an individual user, before the proposed change hits the CI process, but that's different.
But hey, now we've got more confidence in the correctness of something about which I already had no doubt, so there's that.
I didn't doubt 2+2=4 either. :-).
If "it's probably correct" is good enough, we can just ask the humans to review the proof. The reason to use mechanically-verified proofs is to move the verification to "beyond even a shadow of reasonable doubt" stage. Multiple verifiers are, in my mind, a necessary part of that.
If we want to determine if set.mm has an error as soon as possible, we should run only the best and fastest verifier, which does all the checks that we want CI to report. That's the verifier I want to build, but it defeats the purpose if it's not replacing the crappy verifiers.
That's a good thing to do before it hits the CI process. But it's not a race. Correctness is more important than speed.
A reasonable alternative might be to make it much easier to see what CI processes have passed & what's still left to go. The Rust one would complete quickly & report that, the Python one would take a little longer. I doubt many people are really concerned about 1 second vs. 30 seconds in the CI pipeline. Most software development CI pipelines take much longer.
If "it's probably correct" is good enough, we can just ask the humans to review the proof. The reason to use mechanically-verified proofs is to move the verification to "beyond even a shadow of reasonable doubt" stage.
Hmm... That's an interesting question by itself. I see it slightly different: humans are subjective and computers are more formal in some way. So, if I doubt that a proof might contain some subjective reasoning I'd trust that computers have no their own point of view to bring it in with their calculations. And this mean that I wouldn't trust e.g. AI on that matter.
What errors? If a verifier has an error I don't want it in CI.
No such verifiers in any language exist. Metemath.exe isn't. Coq isn't. Lean isn't. ACL2 isn't. Even if I prove it, I don't know if its verifier is correct, nor do I know if its spec is correct in the sense of "what I meant, not what I said".
That doesn't contradict what I said. If the verifier has known bugs, it shouldn't be in CI. None of our verifiers have known bugs, and many are unchanged from when they were first written despite years of testing. Neither set.mm, Travis, or the verifiers are in a position to seriously entertain the question "what if the verifiers are wrong", so I'd rather we assume this is the case for the current discussion, which is about CI of set.mm.
All the verifiers we currently run have extremely low probability of error because they are simple and have been tested for a long time. The chance of an error in the verifier is many orders of magnitude smaller than the chance that the latest addition to set.mm has a problem in it, because that's what the CI is supposed to be checking.
I agree with the relative probability. But the probability of error across multiple verifiers is dramatically smaller than 1 verifier.
The relative probability is already more than enough with 1 verifier. I'm pretty sure that additional verifiers are not delivering positive utility if you compare the reduction of error (which is lower bounded by the possibility of correlated errors because the metamath spec is wrong, btw) to the time, energy, and monetary cost of running an inefficient verifier.
Certainly the probability is lopsided enough that we can offload the full verifier suite to something with a lower frequency, like a nightly.
That's an alternative that I could see. That will complicate the CI process, which I thought we wanted to keep simple.
The per commit CI should be pretty simple: download this program, and run it. I don't think that simple CI should necessarily be a goal; in particular github actions could automate the process of building the web pages as well, and I would like to eventually get away from the current held-together-with-scripts-and-duct-tape situation.
The problem with your approach is that it makes it impossible to improve the CI. If we were to "parallelize the python checker" it would just be a new verifier, we'd add it to CI in addition to everything else, and CI would just get slower.
There's practically no value in reducing the verification speed below 60 seconds within the CI process. There's value in reducing verification speed to 1 second for an individual user, before the proposed change hits the CI process, but that's different.
Time is money when it comes to CI. There is a demonstrable difference between funding 10 seconds of AWS and 3 minutes.
But hey, now we've got more confidence in the correctness of something about which I already had no doubt, so there's that.
I didn't doubt 2+2=4 either. :-).
The worry isn't about 2+2=4, it's bad style, a long line, or a last minute change that is not rewrapped or simply doesn't check. Lots of people use CI to test changes that they themselves did not test beforehand. I can even make edits in github directly (oof, never tried that with set.mm) without running any sort of compiler if I'm confident enough. These are the things that have non-negligible possibility of error. Errors in the formal system, ZFC being inconsistent, and even unknown errors in any of the verifiers are at least 4 orders of magnitude smaller than that. (Accidental/improper usage of axioms is another non-negligible source of error which will not be caught by CI.)
If "it's probably correct" is good enough, we can just ask the humans to review the proof. The reason to use mechanically-verified proofs is to move the verification to "beyond even a shadow of reasonable doubt" stage. Multiple verifiers are, in my mind, a necessary part of that.
This is a bad argument. You are equivocating two very different situations, mediated by the use of "probably". The 4 orders of magnitude estimate above is not completely arbitrary, it's an estimate of my prior on the possibility of errors of the two kinds. An error in style or proof correctness that will be caught by CI is probably somewhere around 10-50%, while an error in the verifier behind the CI is somewhere around 0.001%. An error in a human reviewed proof depends on how many humans are doing the review, but for a peer reviewed math publication I would say it's pretty close to 95% if we want completely error free and 1-5% if we want "mathematically roughly correct". Furthermore, this approach has far lower bandwidth - it takes a lot of time for humans to review a proof, and certainly not an entire corpus every couple hours.
The magnitudes of the probabilities here are important! This is what I mean by "determining the threat model".
That's a good thing to do before it hits the CI process. But it's not a race. Correctness is more important than speed.
Correctness is an unattainable goal. It's like reaching the speed of light - you can always pour some more energy in and tack on a few more 9's on your speed as a fraction of c, but unless you have a proper framework for determining where to make the tradeoff, you will just burn all your money and fail anyway. I assume that more correctness = more utility, but more time, energy, money = less utility and at some point you will say this isn't worth it anymore. The argument you are giving leaves no space for tradeoff, and even if you did care about correctness above all else it would be more worthwhile building a provably correct verifier than run lots of buggy verifiers.
If the verifier has known bugs, it shouldn't be in CI.
Agreed. That's not the issue though.
None of our verifiers have known bugs, and many are unchanged from when they were first written despite years of testing. Neither set.mm, Travis, or the verifiers are in a position to seriously entertain the question "what if the verifiers are wrong", so I'd rather we assume this is the case for the current discussion, which is about CI of set.mm.
I don't accept your assumption that these programs are defect-free. In my experience, non-trivial programs almost always have residual defects, even after testing.
I am specifically talking about CI of set.mm. As you well know, "No known bugs" and "There are no bugs" are dramatically different statements.
...
The relative probability is already more than enough with 1 verifier.
NO. Here we'll just have to disagree. I believe having only 1 verifier, even if it's fully proven, is inadequate. The spec might be wrong, a gamma ray might have tweaked a bit during storage or distribution or compilation or execution, and so on.
In fact, I think we should be adding more verifiers to the set.mm CI process. A key Metamath differentiator is that it's not that hard to build a verifier; we should be taking advantage of that strength.
I'm pretty sure that additional verifiers are not delivering positive utility if you compare the reduction of error (which is lower bounded by the possibility of correlated errors because the metamath spec is wrong, btw) to the time, energy, and monetary cost of running an inefficient verifier.
I disagree:
The per commit CI should be pretty simple: download this program, and run it. I don't think that simple CI should necessarily be a goal; in particular github actions could automate the process of building the web pages as well, and I would like to eventually get away from the current held-together-with-scripts-and-duct-tape situation.
I completely agree with needing to automate the web page build. But I believe that currently takes hours, so your 1-second timeout won't happen anyway. The web automation process could be quick, but no one's prioritized that.
The problem with your approach is that it makes it impossible to improve the CI. If we were to "parallelize the python checker" it would just be a new verifier, we'd add it to CI in addition to everything else, and CI would just get slower.
"Improve the CI" to me means "add more verifiers". Speed is irrelevant below a reasonable threshold.
Even if speed mattered, CI uses parallel processes, and only gets slower if we add a new CI parallel process that takes longer than all existing processes. We're unlikely to add anything slower than Python, which is a notoriously slow language.
Time is money when it comes to CI. There is a demonstrable difference between funding 10 seconds of AWS and 3 minutes.
Not true. Why would we pay? We don't pay now.
... (Accidental/improper usage of axioms is another non-negligible source of error which will not be caught by CI.)
So a CI process running in 1 second is irrelevant, because we'll still want humans to review the change.
Um, that's my point.
... The argument you are giving leaves no space for tradeoff...
I definitely care about tradeoffs, but the tradeoffs you're arguing for don't make sense to me. You're arguing that we should lose our correctness strength (from using multiple verifiers) for what appear to be irrelevant properties. Cost & energy use in particular aren't issues; trading away strength of the correctness verification for these non-issues makes no sense.
If you think it's really important to reduce the CI time, we could reduce it while retaining multiple verifiers. We haven't seriously tried to increase speed of the CI process as a whole from where it is. It's embarrassingly parallel, so simply splitting the work into more CPUs can do much. I think recent problems are due to Travis' financial problems.
None of our verifiers have known bugs, and many are unchanged from when they were first written despite years of testing. Neither set.mm, Travis, or the verifiers are in a position to seriously entertain the question "what if the verifiers are wrong", so I'd rather we assume this is the case for the current discussion, which is about CI of set.mm.
I don't accept your assumption that these programs are defect-free. In my experience, non-trivial programs almost always have residual defects, even after testing.
I am specifically talking about CI of set.mm. As you well know, "No known bugs" and "There are no bugs" are dramatically different statements.
I don't believe they are defect free. But that's not the point. If they are not defect free (or more precisely, if a defect causes a bad CI success), CI is not the place to be discovering and debugging this. If there are bugs we don't know about making all our tests pass for years, throwing more verifiers at it is only going to help once, when the new verifier is introduced, if it is to help at all; the marginal utility of the full suite on every commit is very small because all the verifiers are doing basically the same thing.
Cost: We pay $0. I expect in the future we'll pay $0. If it costs a little, I'd probably be willing to pay it & that'd be my problem not yours (though I don't expect to need to do it). What's this cost you are worrying about?
We pay $0 but someone has to pay, and that someone is now facing financial difficulties. There's no free lunch, and I would rather that we not waste resources.
NO. Here we'll just have to disagree. I believe having only 1 verifier, even if it's fully proven, is inadequate. The spec might be wrong, a gamma ray might have tweaked a bit during storage or distribution or compilation or execution, and so on.
Gamma rays? Did you miss the part where I mentioned probabilities? The probability of this event is far smaller than the events I've mentioned already (I'm not sure how it compares to ZFC falling over but they seem about the same level of remote). If the spec is wrong, then multiple verifiers don't help.
... The argument you are giving leaves no space for tradeoff...
I definitely care about tradeoffs, but the tradeoffs you're arguing for don't make sense to me. You're arguing that we should lose our correctness strength (from using multiple verifiers) for what appear to be irrelevant properties. Cost & energy use in particular aren't issues; trading away strength of the correctness verification for these non-issues makes no sense.
Just to be clear, I'm not suggesting to remove the verifiers completely, only to reduce the full verifier suite to a nightly process. I think we can afford to run all extant verifiers in a day, except for the few that aren't actually practical on set.mm and run out of memory or something. Currently we're only running 3 or 4 verifiers, which seems rather half-assed from a maximizing correctness angle. Surely you can afford to wait a day for a high quality correctness guarantee.
Basically, I want us to define what we are optimizing, and come up with a system that maximizes these things. It doesn't have to be all one thing, and realistically it doesn't make sense for it to be. I think having a quick turnaround on every commit and an unrestrained approach (because we can afford it) on nightlies is the right tradeoff here.
The relative probability is already more than enough with 1 verifier.
NO. Here we'll just have to disagree. I believe having only 1 verifier, even if it's fully proven, is inadequate. The spec might be wrong, a gamma ray might have tweaked a bit during storage or distribution or compilation or execution, and so on.
Wait, I just noticed that this response ignores the "relative" part. You seem to be saying that the probability isn't zero, and, well, duh. But how does the probability of a failure of the metamath spec or even a failure in a single verifier implementation compare with the chances of a bad proof in set.mm being caught by the thing that is built to catch bad proofs in set.mm? Just check the CI logs but I would bet it is at least 10% and often quite high when a user is pushing repeatedly to CI to do their compilation for them. These are utterly incomparable.
Just to be clear, I'm not suggesting to remove the verifiers completely, only to reduce the full verifier suite to a nightly process. I think we can afford to run all extant verifiers in a day, except for the few that aren't actually practical on set.mm and run out of memory or something. ... Surely you can afford to wait a day for a high quality correctness guarantee.
Ah! I didn't realize that was your focus.
In principle I can see that. I think it's very important that the verifiers be run sometime, but nightly would probably be fine.
I don't know if the CI processes can easily be set up to do that though. Generally they just all get triggered by a push & run to completion. GitHub actions can be set to run on specific times ("nightly"), but then they'll be run whether there are changes or not. Maybe the nightly can first detect if the last commit is old, and quietly exit if it is, but that'd take more research.
No matter what, it's clear that that the Travis company is having problems. My first thought is to switch to GitHub Actions - they're also free for OSS.
It's hard to do everything at once. I think we ought to try to convert our existing Travis jobs to GitHub Actions as-is. We can then switch some jobs to a schedule, once we know the jobs work at all.
Github Actions can support all our activities AFAIK: CI on every commit, PR/merge builds, a nightly web page release and full verifier suite test. (Mathlib uses just such a setup, after we similarly switched from Travis.) I don't think there is any rush to get any of these done except perhaps the initial migration to CI on every commit as we are doing now, because we are currently seeing significant delays with Travis.
There was an earlier discussion about doing CI pipeline checks here: https://github.com/metamath/set.mm/issues/1851
I thought it had some starting code... but I was mistaken :-(. Looks like I'll actually have to do some work to make this transition happen. Yikes 😄!
The key documents from GitHub about GitHub actions appear to include:
https://docs.github.com/en/free-pro-team@latest/actions https://docs.github.com/en/free-pro-team@latest/actions/quickstart https://docs.github.com/en/free-pro-team@latest/actions/learn-github-actions/introduction-to-github-actions
Some external sources: https://gabrieltanner.org/blog/an-introduction-to-github-actions https://levelup.gitconnected.com/how-to-write-github-actions-30b54ddf6f52
It's reading time for me. I'm documenting this so others will have doc pointers if they want them.
We may need to consider switching from Travis. Alternatives include GitHub actions.
Details: https://www.jeffgeerling.com/blog/2020/travis-cis-new-pricing-plan-threw-wrench-my-open-source-works https://blog.travis-ci.com/2020-11-02-travis-ci-new-billing