metamath / set.mm

Metamath source file for logic and set theory
Other
241 stars 88 forks source link

"SHOW RESTRICTED" in Travis #59

Closed digama0 closed 7 years ago

digama0 commented 8 years ago

The latest version of metamath.exe includes a command SHOW RESTRICTED that will allow us to verify that (New usage/Proof modification is discouraged.) marks are honored. Norm writes:

I finished implementing the "is discouraged" markup in metamath.exe and will release it after some more testing.

For comparing set.mm to previous versions, I decided against reading two files in metamath.exe because I think external scripting of the diff can provide more flexibility. A new command, "show restricted", simply lists all statements containing the "discouraged" comment markup for diff comparison with a previous set.mm. I think you mentioned this was sufficient for a travis script.

Comparing proofs is somewhat problematic because changing labels and reordering hypotheses can make them mismatch completely. One thing that is constant is the number of steps, so I am listing that instead of the actual proof. In practice, I think this is sufficient to detect 99.99% of overridden proofs because it seems very unlikely that a revised proof would have exactly the same number of steps. For the use case I'm most concerned about, minimize_with, the number of steps would always be different.

Here is a sample session. I prefixed each line with the string "SHOW RESTRICTED:" for easy identification of the lines with a script.

To make diff results easier to interpret, I am listing each usage on a new line with both the original statement and its target.

--- begin sample session ---

MM> help show restricted
Syntax:  SHOW RESTRICTED

This command shows the usage and proof statistics of statements with
"(Proof modification is discouraged.)" and "(New usage is
discouraged.)" markup tags in their description comments.  The output
is intended to be used by scripts that compare a modified .mm file
to a previous version.

MM> show restricted
SHOW RESTRICTED:  Proof of "id1" is restricted (26 steps).
SHOW RESTRICTED:  Proof of "dfbi1gb" is restricted (100 steps).
SHOW RESTRICTED:  Proof of "jcabOLD" is restricted (42 steps).
SHOW RESTRICTED:  Usage of "jcabOLD" is restricted (0 uses).
SHOW RESTRICTED:  Proof of "pm3.43OLD" is restricted (17 steps).
SHOW RESTRICTED:  Usage of "pm3.43OLD" is restricted (0 uses).
SHOW RESTRICTED:  Proof of "con3th" is restricted (65 steps).
SHOW RESTRICTED:  Proof of "tru2OLD" is restricted (9 steps).
SHOW RESTRICTED:  Usage of "tru2OLD" is restricted (0 uses).
SHOW RESTRICTED:  Usage of "ax-meredith" is restricted (13 uses).
SHOW RESTRICTED:  "ax-meredith" is used by "merlem1".
SHOW RESTRICTED:  "ax-meredith" is used by "merlem2".
SHOW RESTRICTED:  "ax-meredith" is used by "merlem1".
SHOW RESTRICTED:  "ax-meredith" is used by "merlem2".
SHOW RESTRICTED:  "ax-meredith" is used by "merlem3".
SHOW RESTRICTED:  "ax-meredith" is used by "merlem4".
SHOW RESTRICTED:  "ax-meredith" is used by "merlem5".
SHOW RESTRICTED:  "ax-meredith" is used by "merlem7".
SHOW RESTRICTED:  "ax-meredith" is used by "merlem8".
SHOW RESTRICTED:  "ax-meredith" is used by "merlem9".
SHOW RESTRICTED:  "ax-meredith" is used by "merlem10".
SHOW RESTRICTED:  "ax-meredith" is used by "merlem11".
SHOW RESTRICTED:  "ax-meredith" is used by "merlem13".
SHOW RESTRICTED:  "ax-meredith" is used by "luk-1".
SHOW RESTRICTED:  "ax-meredith" is used by "luk-2".
SHOW RESTRICTED:  "ax-meredith" is used by "luk-2".
SHOW RESTRICTED:  Proof of "merlem1" is restricted (62 steps).
SHOW RESTRICTED:  Usage of "merlem1" is restricted (3 uses).
SHOW RESTRICTED:  "merlem1" is used by "merlem2".
SHOW RESTRICTED:  "merlem1" is used by "merlem5".
......  etc. .............

This command can easily be converted to a usage generator script:

 ./metamath 'read set.mm' 'show restricted' exit | grep 'SHOW RESTRICTED' > set.usage

will calculate the usage in set.mm and output it to set.usage. Treated as a function from mm files to usage files, whenever a change is introduced that changes the usage output, this change ought to be marked as "override" or something similar.

The problem is that Travis is not designed for evaluating a build in the context of previous builds. It's not that it's impossible - it's as simple as checking out the HEAD^ version of set.mm and comparing the generated set.usage files. But if this fails, it represents a non-repeatable "error", since pushing the exact same commit a second time will resolve the failure.

One way to solve this is to store the usage file in the repo. Then the correctness check that can be performed by Travis is that the generated usage matches the stored usage file, and the equivalent of an "override" in this context is an update to the usage file in the commit using the above command. This does mean that anyone who is not capable of running this command (adapted for their platform) will not be able to override usage checks, but this seems like a reasonable compromise.

Thoughts?

david-a-wheeler commented 8 years ago

I agree, a separate file is the cleanest solution. In one sense, this is a big change because set.mm will now be defined in multiple files. And it's about time. This massive file needs to be split up anyway, so this is a good time to start.

digama0 commented 8 years ago

I wouldn't say "set.mm is defined in several files". This is more like a compiled version of the file; the data passes strictly in one direction set.mm -> set.usage.

nmegill commented 8 years ago

On 7/7/16 2:21 AM, Mario Carneiro wrote:

The latest version of metamath.exe includes a command SHOW RESTRICTED that will allow us to verify that (New usage/Proof modification is discouraged.) marks are honored. Norm writes:

...

|MM> help show restricted Syntax: SHOW RESTRICTED This command shows
the usage and proof statistics of statements with "(Proof
modification is discouraged.)" and "(New usage is discouraged.)"
markup tags in their description comments. The output is intended to
be used by scripts that compare a modified .mm file to a previous
version. MM> show restricted SHOW RESTRICTED: Proof of "id1" is

[etc.]

BTW I used "restricted" instead of "discouraged" because at the time I wasn't sure that "discouraged" would be the final tag keyword we would agree on. Without too much trouble, I can change it to "discouraged" everywhere, so the command would be "show discouraged" with output "SHOW DISCOURAGED: ..." etc. What do you think?

Something to be aware of: unfortunately 'show restricted' currently takes about 1.5 minutes to run. I think most of the time is spent finding the usage for around 5000 statements that have usage discouraged. I can probably speed that up.

This command can easily be converted to a usage generator script:

|./metamath 'read set.mm' 'show restricted' exit | grep 'SHOW RESTRICTED'

set.usage |

will calculate the usage in |set.mm| and output it to |set.usage|. Treated as a function from mm files to usage files, whenever a change is introduced that changes the usage output, this change ought to be marked as "override" or something similar.

The problem is that Travis is not designed for evaluating a build in the context of previous builds. It's not that it's impossible - it's as simple as checking out the |HEAD^| version of set.mm and comparing the generated |set.usage| files. But if this fails, it represents a non-repeatable "error", since pushing the exact same commit a second time will resolve the failure.

I have some questions how Travis works. What is done with the output of Travis? If there is a failure, how does it notify the user? Wouldn't the user know there was "discouraged" override from looking at the current log? Is the log kept for each run, or is it wiped each time?

Second, there is the issue of to what extent a "discouraged" override should be considered an error. Usually it would not be, I think, assuming the proof assistant software prevents "discouraged" usage unless overridden. And if it is overridden, it was obviously done on purpose for some reason. "Discouraged" means discouraged, not prohibited, and we already have (in metamath.exe and hopefully other proof assistants eventually) mechanisms to block casual or accidental usage.

Third, to pin down a non-obvious discouraged override, a human may have to inspect the previous set.mm vs. the current set.mm. If another commit is done, the previous set.mm no longer has the difference, and the user will have to go back to some unknown earlier set.mm that the set.usage file corresponds to.

I guess what I'm asking is, why would another commit be done without inspecting the Travis output for the current commit and fixing any problems? I think in almost all cases a discouraged override would be valid, even with manual edits like deleting a *OLD, changing a label, adding a new statement to a discouraged section, etc. and all we would need is just a quick sanity check by a human.

(Question: in the Travis script, what is the purpose of: && [ egrep -q '?Error|?Warning' < mm.log; echo $? -eq 1 ] I'm not a bash expert. but this seems to do a test for an error or warning without doing anything with the result. I thought maybe it sets the script exit status, but that would be overridden by the next verifier run.)

One way to solve this is to store the usage file in the repo. Then the correctness check that can be performed by Travis is that the generated usage matches the stored usage file, and the equivalent of an "override" in this context is an update to the usage file in the commit using the above command. This does mean that anyone who is not capable of running this command (adapted for their platform) will not be able to override usage checks, but this seems like a reasonable compromise.

Thoughts?

I'll go along with it if others want it. I am just wondering if it is overkill per my discussion above. It would be one more thing we have to maintain.

Maybe for now we can just put in a simple current vs. previous set.mm 'discouraged' check and see how it goes. It would already be better than what we have now, which is nothing. I don't think 'discouraged' overrides will happen very often anyway. If it ends up being a problem like you describe, then we can upgrade the Travis script to use set.usage.

Norm

nmegill commented 8 years ago

For master releases, I think that I will start to do a 'show restricted' comparison to the previous master release as part of my overall QA process (using the Travis log or by hand). I have doing that informally in the past every now and then, checking to see that axiomatic studies are isolated with 'show usage ax-meredith/rec' etc. Even with set.usage, I think inspection by someone like me who "knows" what the purpose is (especially since I added them in the first place) is a good idea, and by comparing master releases I can do everything at once instead of having to inspect every develop commit.

Norm

digama0 commented 8 years ago

On Thu, Jul 7, 2016 at 9:43 AM, nmegill notifications@github.com wrote:

BTW I used "restricted" instead of "discouraged" because at the time I wasn't sure that "discouraged" would be the final tag keyword we would agree on. Without too much trouble, I can change it to "discouraged" everywhere, so the command would be "show discouraged" with output "SHOW DISCOURAGED: ..." etc. What do you think?

Seems reasonable.

Something to be aware of: unfortunately 'show restricted' currently takes about 1.5 minutes to run. I think most of the time is spent finding the usage for around 5000 statements that have usage discouraged. I can probably speed that up.

It's 2.5 minutes on my machine. Another downside of doing differential usage calculation in the Travis script is that it has to calculate usage twice (once for the new and the old), and there's no obvious place to cache the old result.

Maybe smm3 will have a better go at it.

I have some questions how Travis works. What is done with the output of Travis? If there is a failure, how does it notify the user? Wouldn't the user know there was "discouraged" override from looking at the current log? Is the log kept for each run, or is it wiped each time?

The travis log is displayed at https://travis-ci.org/metamath/set.mm (you may need to log in to see it). As far as I'm aware there is no way to recover additional output besides the log itself, so I would probably set it up to output "diff set.usage set.usage-calc" so that you can see what specifically changed in the 1MB file.

You can view all historical logs of previous builds at https://travis-ci.org/metamath/set.mm/builds. There are also links to each build in a pull request (click on the green checkmark / red X next to the commit hash, for example in https://github.com/metamath/set.mm/pull/57), along with builds for the overall PR in "show details", as well as builds on each branch, for example https://github.com/metamath/set.mm/compare/develop.

If a Travis build fails, the author and the committer (usually the same person) are emailed a "build is broken" from Travis, which is resent with each commit until the build is fixed.

The log will indicate the failure, but that doesn't mean the user will look at the log. We want to make sure that errors that are flagged by Travis have to be explicitly acknowledged by the user in order to get the go-ahead.

Second, there is the issue of to what extent a "discouraged" override should be considered an error. Usually it would not be, I think, assuming the proof assistant software prevents "discouraged" usage unless overridden. And if it is overridden, it was obviously done on purpose for some reason. "Discouraged" means discouraged, not prohibited, and we already have (in metamath.exe and hopefully other proof assistants eventually) mechanisms to block casual or accidental usage.

Indeed, this is a point against using "discouraged" in Travis at all. The main issue is that we don't require any particular authoring tool to be used, and Travis is a funneling point for any kind of edit to go through the same rigorous checks.

Third, to pin down a non-obvious discouraged override, a human may have to inspect the previous set.mm vs. the current set.mm. If another commit is done, the previous set.mm no longer has the difference, and the user will have to go back to some unknown earlier set.mm that the set.usage file corresponds to.

If the Travis tool ensures that all (acceptable) commits have an in-sync set.mm/set.usage, then the first errored commit will correctly identify the usage change. Ideally, the user should at this point run the usage generator in order to "accept" the change, or revert the set.mm change if it was not intended. If the user instead chooses to ignore the Travis failure, and changes even more, Travis will continue to fail, and at this point the diff will be pointing out changes since the last successful build.

I think this is a better target than comparing against the last commit, because if a large change is enacted with many small commits, the usage diff will be fragmented. More importantly, earlier commits may not even be built (if you push two commits at once the older one is not built), so some usage changes may go unnoticed. By keeping the usage file in the repo we can ensure that no changes slip by in this manner.

I guess what I'm asking is, why would another commit be done without inspecting the Travis output for the current commit and fixing any problems? I think in almost all cases a discouraged override would be valid, even with manual edits like deleting a *OLD, changing a label, adding a new statement to a discouraged section, etc. and all we would need is just a quick sanity check by a human.

Presumably, our users are smart enough to do this. But we're all human and make mistakes, and Travis errors on a positive proportion of commits, which just goes to show that you can't assume any invariant is held unless you check it mechanically.

(Question: in the Travis script, what is the purpose of:

&& [ `egrep -q '?Error|?Warning' < mm.log; echo $?` -eq 1 ]

I'm not a bash expert. but this seems to do a test for an error or warning without doing anything with the result. I thought maybe it sets the script exit status, but that would be overridden by the next verifier run.)

After running "metamath | tee mm.log", which prints the command output live to the log so you can watch it while also printing to a file, I parse the output file looking for errors. grep will print the found lines to stdout (which I don't care about), and will report error code 0 on found matches, 1 on no matches, and 2 or other codes if there was an error (maybe IO problems). The "success" condition is error code 1, so I test $? after running egrep to get the error code. If both the original metamath command and the error check succeed, then there are no errors.

I'll go along with it if others want it. I am just wondering if it is overkill per my discussion above. It would be one more thing we have to maintain.

I thought about this. Another option is that Travis itself will commit the updated usage file if it sees a disconnect. This will free us from having to update it ourselves. I was going to say that this makes it too easy to override usage without noticing, but actually it would still be fairly obvious, since I read all the commits that come through anyway.

Mario

digama0 commented 8 years ago

(Question: in the Travis script, what is the purpose of:

&& [ `egrep -q '?Error|?Warning' < mm.log; echo $?` -eq 1 ]

I'm not a bash expert. but this seems to do a test for an error or warning without doing anything with the result. I thought maybe it sets the script exit status, but that would be overridden by the next verifier run.)

Also, I forgot to mention that each line in the Travis script is run separately, timed, and a nonzero error code halts the build as "failed". (That's why the error code does not need to be checked in the next line.) By contrast, if a step in the install section fails, the build is stopped as "errored" instead. The idea is that the script section checks the validity of set.mm itself (which is the content of the new commit), while problems in other steps of the process (like compiling metamath) are marked as build errors rather than failures.

nmegill commented 8 years ago

On 7/7/16 10:21 AM, Norman Megill wrote:

For master releases, I think that I will start to do a 'show restricted' comparison to the previous master release as part of my overall QA process (using the Travis log or by hand). I have doing that informally

Actually it would be by hand, since I wouldn't want to update the master until I inspect all 'discouraged' differences.

in the past every now and then, checking to see that axiomatic studies are isolated with 'show usage ax-meredith/rec' etc. Even with set.usage, I think inspection by someone like me who "knows" what the purpose is (especially since I added them in the first place) is a good idea, and by comparing master releases I can do everything at once instead of having to inspect every develop commit.

On 7/7/16 10:32 AM, Mario Carneiro wrote:

unless overridden. And if it is overridden, it was obviously done on purpose for some reason. "Discouraged" means discouraged, not prohibited, and we already have (in metamath.exe and hopefully other proof assistants eventually) mechanisms to block casual or accidental usage.

Indeed, this is a point against using "discouraged" in Travis at all. The main issue is that we don't require any particular authoring tool to be used, and Travis is a funneling point for any kind of edit to go through the same rigorous checks.

I am also thinking that we shouldn't put it in Travis. Checking master vs. previous master by hand isn't a big deal for me, and I would end up with more confidence than doing incremental develop checks where I may have missed one.

Norm

digama0 commented 8 years ago

Regarding larger comparisons such as master-to-master updates, if the set.usage file is in the repo then this can be done easily. If you take any two Travis-certified commits and compare them, the difference in set.mm will be what it always has, and the difference in set.usage will be the change in usage between the two versions.

One problem with usage comparisons of any sort is that theorem reordering will cause the usage file to also be reordered, which is a "spurious" usage change. This can be mitigated by alphabetizing the file (I'm sure there's a linux utility for that), although that wouldn't help against theorem renames.

nmegill commented 8 years ago

On 7/7/16 10:53 AM, Mario Carneiro wrote:

Regarding larger comparisons, if the set.usage file is in the repo then this can be done easily. If you take any two Travis-certified commits and compare them, the difference in set.mm will be what it always has, and the difference in set.usage will be the change in usage between the two versions.

One problem with usage comparisons of any sort is that theorem reordering will cause the usage file to also be reordered, which is a "spurious" usage change. This can be mitigated by alphabetizing the file (I'm sure there's a linux utility for that), although that wouldn't help against theorem renames.

I understand that. I don't think it will be a big problem (for me at least), although experience will tell. I think keeping set.usage synchronized with moves and label changes (without cheating by re-running 'show restrictions') sounds like a pain and would probably be more work overall.

Almost all of the 'discouraged' tags are in long-stable sections that are unlikely to get moved or have their labels changed. In any case, since I'm familiar with most of those sections, I'll probably recognize a move or label change easily.

For fun, I just compared the 'show restricted' output of a 5/22 set.mm and the current one. The output of set.mm diff had 146K lines, and the 'show restrictions' diff had 138 lines. All but one were due to new 'discouraged' tags I added (and their usages therefore appearing, causing most of the diff lines). There is a mystery one, mulge0OLD, whose proof got shorter in spite of a 'proof modification is discouraged' tag, and I'll have to do some research for that one to see what happened.

Norm

digama0 commented 8 years ago

On Thu, Jul 7, 2016 at 11:58 AM, nmegill notifications@github.com wrote:

On 7/7/16 10:53 AM, Mario Carneiro wrote:

Regarding larger comparisons, if the set.usage file is in the repo then this can be done easily. If you take any two Travis-certified commits and compare them, the difference in set.mm will be what it always has, and the difference in set.usage will be the change in usage between the two versions.

One problem with usage comparisons of any sort is that theorem reordering will cause the usage file to also be reordered, which is a "spurious" usage change. This can be mitigated by alphabetizing the file (I'm sure there's a linux utility for that), although that wouldn't help against theorem renames.

I understand that. I don't think it will be a big problem (for me at least), although experience will tell. I think keeping set.usage synchronized with moves and label changes (without cheating by re-running 'show restrictions') sounds like a pain and would probably be more work overall.

I guess I should say that I have zero intention of manually editing the file. For all I care it could be an opaque binary file (except that the diffs ought to be somewhat readable). Manually adjusting the file to match what it should look like "with cheating" is much too hard to get right that it is not worthwhile. The interaction is "change set.mm" -> "break Travis" -> "examine diff" -> "regenerate usage" -> "Travis is happy" (or "change set.mm" -> "break Travis" -> "Travis fixes itself" with automatic Travis commits, with "examine diff" in there concurrently).

For fun, I just compared the 'show restricted' output of a 5/22 set.mm and the current one. The output of set.mm diff had 146K lines, and the 'show restrictions' diff had 138 lines. All but one were due to new 'discouraged' tags I added (and their usages therefore appearing, causing most of the diff lines). There is a mystery one, mulge0OLD, whose proof got shorter in spite of a 'proof modification is discouraged' tag, and I'll have to do some research for that one to see what happened.

That was me, around the time when adding algebra deduction theorems. The old version was deprecated because of its hypothesis order, not because it had an interesting proof, and I just referred it to the newer proof.

Mario

sorear commented 8 years ago
&& [ `egrep -q '?Error|?Warning' < mm.log; echo $?` -eq 1 ]

Can't this be better written as:

&& ! grep -qE '?Error|?Warning' mm.log

There's no need to drag $? into this, plain old status negation works; grep takes a filename as an argument; and egrep is vaguely deprecated in favor of grep -E

digama0 commented 8 years ago
david-a-wheeler commented 8 years ago

YAML treats "!" specially, so you sometimes have to quote. I did that elsewhere in the same file.

If grep errors for any reason there's a problem, so I think it's fine to fail then.

"egrep" isn't in the POSIX spec; "grep -E" is in POSIX spec. So I'd use "grep -E" instead.

digama0 commented 8 years ago

On Thu, Jul 7, 2016 at 8:00 PM, David A. Wheeler notifications@github.com wrote:

If grep errors for any reason there's a problem, so I think it's fine to fail then.

Right, that's the problem. If grep fails with error code 2, then "! grep" will return error code 0 which is a success, not a failure. (Ideally this should fail as "errored", because the grep error is probably not related to the contents of set.mm, but it is important to never mistakenly mark a failure as success.)

nmegill commented 8 years ago

On 7/7/16 10:32 AM, Mario Carneiro wrote:

On Thu, Jul 7, 2016 at 9:43 AM, nmegill notifications@github.com wrote:

BTW I used "restricted" instead of "discouraged" because at the time I wasn't sure that "discouraged" would be the final tag keyword we would agree on. Without too much trouble, I can change it to "discouraged" everywhere, so the command would be "show discouraged" with output "SHOW DISCOURAGED: ..." etc. What do you think?

Seems reasonable.

This is now done in metamath.exe version 0.132. I changed the text in http://us2.metamath.org:88/mpegif/mmnotes.txt entry of 11-May-2016 to reflect all of the changes.

digama0 commented 7 years ago

Addressed in pull request #91, merged in 7414a93.