Open wlammen opened 2 years ago
I've been wondering how to approach this issue as well. Things will settle down eventually, once the code is a bit closer to locally optimal, but I expect to be putting in a large number of changes in the near future because the objective function has changed discontinuously and so we are currently far from that optimal point.
I don't want to be too forceful here in pushing my changes, but I think we should aim for higher throughput at the expense of review quality in the near term, at least until the big refactors are mostly done, set up CI and testing to ensure no obvious breakage, release a new version, and get some testing and bug reports from users in case of regressions.
Metamath has a lot of edge case behaviors, and I think we will need a big test suite to ensure compatibility with all of it (or at least the part of it we consider important). I think that good testing will be more useful in the long run than careful review (because it's a C project and eyeballs are only so good). Once we start to approach the optimum I think we can start to raise our documentation, testing, and review standards.
Just one answer in 10 days...
For comparison: Here is a code review check list: https://github.com/mgreiler/code-review-checklist. Rest assured that I by far did not not use it exhaustively in my review attempts. They were a compromise between my limited knowledge about the topic at hand, my limited technical skills, my limited time, the perceived eagerness of developers to make perhaps long-awaited changes, and general rules governing software development.
Of course, the overall goals of the metamath-exe project determine the amount of scrutiny used in review. Like security, review rises costs, but may not yield any short-term advantages in return. Review is about quality. The advantages are often seen mid-term or long-term only.
IMO people voiced their expectation to have a rock-solid, reliable tool on hand. This puts the level of quality very high. My personal idea is to have a piece of software that is accessible to a wider audience. This puts a limit on the complexity and diversity of the software as well.
My hope was to extract from the answers here a draft review check list. So far this did not materialize.
I personally don't think we need to systematize this much. Reviewing just means look over the code as best you can, point out odd things or design decisions, and then move on. I don't think we actually have enough manpower for an extensive review process, at least not without making progress on the codebase grind to a halt.
IMO people voiced their expectation to have a rock-solid, reliable tool on hand. This puts the level of quality very high. My personal idea is to have a piece of software that is accessible to a wider audience. This puts a limit on the complexity and diversity of the software as well.
metamath.exe is (currently) a ball of hacks. Just because it handles a formal system doesn't make it less hackish. Big changes need to happen to even make it possible to review to ensure robustness. We simply aren't ready yet for that level of review.
It seems like there isn't going to be much oversight in this repo, because of lack of interest. So I'm contemplating just merging my own PRs at this point: if no one wants to comment then I'll just continue to the next phase.
metamath.exe is (currently) a ball of hacks. ... Big changes need to happen to even make it possible to review to ensure robustness.
Transition to a more professional style can imo be done systematically, in a steady (and measurable) progress. And if done properly, risk to break current semantics even in corner cases, can be reduced. Such a process begins (my experience) with the evaluation of the current state. Based on that you later define goals, i.e. functionality, styles, architecture, future directions and so on, and methods (bottom-up or reverse, contribution checks, tests, tools etc.) to achieve them. See my note in issue #40.
Starting the evaluation phase was the intent of my PR #41. Besides laying ground for a proper transition, if desired, it allows more people taking part in developing. Because the first hurdle, becoming expert enough to contribute sensibly, is overcome more easily. And the risk of nasty bugs is reduced to boot.
I wouldn't say that the current code is just a ball of hacks. I have seen worse in my life. A few parts look in fact quick and dirty, but in general I perceive an underlying concept as well.
In my eyes, not only the source code is subject to transition. The development process itself needs adjustment. The God-like eminence that decides what to do or not is gone, as well as his profound knowledge of where to expect pitfalls. We need experience and must learn to better judge on how to proceed, as a team.
The role of review: In retrospect, I would say, reviewing your code for example was not pointless. Portability issues were fixed, oversights and awkward style pointed out, possible misunderstandings explained. Since learning is imo an imminent issue, I summarize my findings in following rules:
It seems we are for now the only one caring about a proper code base. While I think, in general, the discussion should be public, for quick exchange you can contact me per eMail <firstname>.<lastname> <at> <common Google account>
as well.
Addendum: Review lists like https://github.com/mgreiler/code-review-checklist are not restricted to peer-review. You can check your own contribution whether it matches these quality criteria. Obviously, assessing your own code may result in a biased and short-sighted view. Still, chances are, you sometimes correct shortcomings yourself before someone else is hit by them, or points them out.
Metamath has a lot of edge case behaviors, and I think we will need a big test suite to ensure compatibility with all of it (or at least the part of it we consider important). I think that good testing will be more useful in the long run than careful review
Totally agree with the value we'd get from automated tests. Any help you want with setting up automated tests? I don't want to overpromise how much time I'll have for any of this (given that proofs, apparently, are more fun for me than proof software), but if we can get a bit more concrete about how we might approach this (test frameworks, getting it set up in github, etc) maybe I could chip in.
I can make a separate issue if we want because it is a separate issue from code reviews (and there's nothing saying you can have both tests and code reviews).
Here's a test structure that I think will work with metamath.exe: Each test consists of:
test.in
test.mm
(optional)test.expected
(optional)test.produced
(build output, cleaned by make clean and .gitignored)The test runner is a shell script ./run_test
with the syntax
./run_test [--bless] test1[.in] test2[.in] ...
which works with xargs
to run the entire test suite. (The .in
suffix is optional and added if not present.)
The script will do the following for each test test.in
:
test.in
existstest.mm
exists, then run ./metamath test.mm < test.in > test.produced
, else ./metamath < test.in > test.produced
test.in
has a line ! expect_fail
in which case we assert exit code 1! run_test
then we discard test.produced
--bless
is provided, then copy test.produced
to test.expected
diff test.produced test.expected
and print the results (and return with diff's exit code)(This is inspired by the lean test suite.)
In short, we have two kinds of tests: run-tests (containing ! run_test
) are just for checking that overall compilation works, and don't worry about the result, while "porcelain" tests check the exact content of the error message to ensure that it succeeds or fails in the right way. The --bless
command is for producing the test.expected
files: the idea is that it is run in CI without --bless
so that when it fails you see a printout saying how the output did not come out as expected, and during development, if there has been a change to the display behavior, then you run all the affected tests with --bless
to update the *.expected
files (and the change becomes part of the PR, so people will review that the change was reasonable).
So I can't fully decide whether to weigh in more directly on the code review topic but I guess when I see threads like https://github.com/metamath/metamath-exe/pull/55#discussion_r796143316 I have trouble resisting the temptation to offer a third opinion to, hopefully, help us take a step back from the brink and all avoid getting into the mindset of "I am obviously right and why don't those people just understand?". I'm not sure I'm enough of a diplomat to improve things rather than fan the flames but I'll try:
I already said I might be the utopian one, but I do think we are figuring out this stuff, however imperfectly. While writing a CONTRIBUTING.md file (including some progress on documentation) and checking in a (very small at the moment) test suite might seem like small steps, to me they are evidence that even if we don't agree on everything, we are getting a bit closer to a development process which helps us make progress together.
I just want to add that writing tests is an easy activity that anyone can get in on. The process is as follows:
test.mm
file, literally any sequence of characters is fine, bonus points for weird onestest.in
file to say what you want to check about the mm file, usually something like ve p *
or ve m *
or sh p foobar
./run_test.sh --bless test
to generate the test.expected
file, and check that it matches your expectationWe have 5 or 6 tests at the moment, each of which was only a couple seconds of work. I would like to get 100+ tests in there in the short term, so please use your imagination and write whatever you can.
I reviewed recently some code, two pull requests among them were quite big. We neither have code styles nor review check lists in operation. Hopefully we decide on the documentation rules soon, so at least in theory we can fill in these gaps.
Instead I followed the rules that are in effect in the company where I work. I think we should discuss the results here. This all is about learning. How can we improve the workflow? What level of quality (both in code and review) should we aim at? I am looking forward to hearing what you have got to say.