sorear commented 8 years ago

Synopsis

Create tools to convert between a multi-file representation of the database and a single-file representation. Ensure they can round-trip, including after a WRITE SOURCE from metamath.exe.

Convert this repository to the multi-file format, with a filter-branch to avoid any large objects in the history.

Rationale

Most style guides recommend keeping source code files in computer programs under 1000–5000 lines, for industrial-psychology reasons. set.mm is not really a computer program and those rules are not applicable as such, but it's similar enough to one that we want to use many of the same tools, including git, git-based change visualizers, and programmer-oriented text editors. However, those tools are optimized for a world where "large projects" like the Linux kernel consist of many relatively small files (v4.6 has 22787 *.c files, median 354 lines, 99th percentile 4836 lines). Some of them can handle a single 29MB file just as easily; others not so much.

Also, we are moving deeper and deeper into a world where individual developments don't depend on the whole available database. Tooling today struggles to maintain responsiveness with the whole database, and this will only get worse. We need to be able to point tools at subsets of the database, and the most natural way to do that (IMO) is by pointing tools at a subset of the files.

Since the biggest (current!) problems with the large file are related to the git repository, any solution must involve converting the git repository. Additionally, backdating the conversion will make initial clones much happier.

What the database will look like

We already have a facility in the Metamath language to spread a database across several files, the file-inclusion mechanism, and I want to minimize the amount of ecosystem fragmentation this change creates so I will be using that same mechanism. There are two ways that can look:

Dependency graph:

$( basics.mm $)
...
$( topology.mm $)
$[ basics.mm $]
...
$( polynomials.mm $)
$[ basics.mm $]
...
$( polynomials_are_continuous.mm $)
$[ topology.mm $]
$[ polynomials.mm $]
...

"Project file"

$( basics.mm $)
...
$( topology.mm $)
...
$( polynomials.mm $)
...
$( polynomials_are_continuous.mm $)
...
$( top.mm $)
$[ basics.mm $]
$[ topology.mm $]
$[ polynomials.mm $]
$[ polynomials_are_continuous.mm $]

The "dependency graph" approach makes it easier to work with subsets, but "project file" is better suited to automatic round-trip conversion. I'll probably start with the latter then make another ticket later to switch to the first.

Tooling requirements

Some tools will prefer the single file. Some users of tools that are themselves neutral prefer the single file. Some tools presently require the single file; in particular, the only tool that can currently make automated semantic changes to Metamath databases, metamath.exe, supports reading collections of files but will always write as a single file.

To support both use cases, tools will be created to convert back and forth between single-file and multi-file. Actually, we already have a splitter program, but it got little buy-in. I assume the reason for that is the difficult Node.js/Babel.js installation story, and a SMM3-based converter which could be distributed as a single statically-linked binary might be better received? (Metamathicians, is this a fair assessment?)

When?

That's the question. This ticket exists so that we can determine when the tooling requirements are met and coordinate a time for the mass filter-branch and force push.

digama0 commented 8 years ago

Regarding the splitter program, yes, installation has been a pain point. I know for sure that if you could distribute pre-built executables like metamath.exe it would make it a lot more attractive for the less computer-literate contributors (which is important for a utility that will need to be run by all major contributors).

david-a-wheeler commented 8 years ago

A potential advantage of splitting the database is that it might make it easier to share common theorems even when there are differences in the underlying axioms. Most work has gone into set.mm (the Metamath Proof Explorer). However, there is a significant amount of overlap in what can be proven using the New Foundations Explorer, HOL, and Intuitionistic Logic Explorer - they all, for example, allow numbers. This might be a little complex in terms of figuring out how to split up the files, though.

david-a-wheeler commented 8 years ago

I would strongly prefer the "Project file" form, especially since I think that the same file might be included by different "top levels" (e.g., classic vs. intuitionist logic).

Splitting up the files would have an advantage in terms of speeding "git clone". I just did one and it takes a while, because git notionally stores each file variation separately (there's some optimization, but nothing is as efficient as "file unchanged, so I send nothing").

jkingdon commented 6 years ago

Has there been any further thinking? Doing this for set.mm affects a lot of people, but I've been thinking that for iset.mm maybe I should just go for it. That is, (a) split iset.mm into separate files (which would probably only be about 4 files because it is much smaller than set.mm), (b) install the splitter program and use that when I write files from metamath.exe (which is mostly when I rewrap comments currently, but that's probably not the only situation where I might want to do this). I wouldn't bother with any git filter-branch type stuff at least for now (partly because iset.mm is so much smaller than set.mm).

I guess that would affect me and @nmegill (whose workflow to generate the website might be affected?) and of course any other (hypothetical/future at the moment) contributors to iset.mm.

nmegill commented 6 years ago

I am planning to change metamath.exe to give 'write source' an option to produce split output files. Essentially you will be able to go back and forth between a single file or split files depending on your editing preference. For the single file version, some new markup will have to be defined that indicates the end of an included section. I hope to do this in a couple of months.

On a personal note, many years ago I split up set.mm and really disliked working with several files, especially for global edits. So after a month or so I went back to a single file and will likely continue to do so, even for set.mm. I find it nice to have everything in one place instead of having to remember which section something was in.

But I know some people want split files, and the metamath.exe enhancement should make it equally easy to use either approach. 'read' will read either split or single file as it does now, and 'write source' will produce whichever source type you prefer with a qualifier. For 'write source', the master or starting file name like set.mm will be user-specified as it is now, and the included files will be given the names specified in their inclusion directives.

Norm

On 12/8/17 12:21 PM, Jim Kingdon wrote:

Has there been any further thinking? Doing this for set.mm affects a lot of people, but I've been thinking that for iset.mm maybe I should just go for it. That is, (a) split iset.mm into separate files (which would probably only be about 4 files because it is much smaller than set.mm), (b) install the splitter program and use that when I write files from metamath.exe (which is mostly when I rewrap comments currently, but that's probably not the only situation where I might want to do this). I wouldn't bother with any git filter-branch type stuff at least for now (partly because iset.mm is so much smaller than set.mm).

I guess that would affect me and @nmegill https://github.com/nmegill (whose workflow to generate the website might be affected?) and of course any other (hypothetical/future at the moment) contributors to iset.mm.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/metamath/set.mm/issues/29#issuecomment-350320155, or mute the thread https://github.com/notifications/unsubscribe-auth/AMKLqKno1w8f4IH3d2Cwg9zypRNJVviCks5s-XAQgaJpZM4I-f4K.

jkingdon commented 6 years ago

Hmm, part of why I'm interested in multiple files is that I wonder whether it will make it easier to organize theorems. For example, files for ordinals and cardinals, graph theory, and analysis, which depend on basic set theory but which don't depend on each other. But that's not a detailed plan, just sort of a hunch. I see what you are saying about how (at least with some tools) having everything together can be convenient.

Anyway, the plan about being able to WRITE SOURCE in either format makes sense to me. Thanks for the update on the plans here.

nmegill commented 6 years ago

I think I understand what you are saying. The Metamath language spec has a feature that should make this easier, which is that it silently ignores subsequent occurrences of the same included file. So graph.mm and analysis.mm could each have the includes $[ prop.mm $], $[ pred.mm $], $[ basicset.mm $] so they can be worked on independently. The master set.mm (in the split file version) could have the includes $[ prop.mm $], $[ pred.mm $], $[ basicset.mm $], $[ graph.mm $], $[ analysis.mm $] without conflict. Alternately, the master set.mm could just have the includes $[ graph.mm $], $[ analysis.mm $], and depend on them to load the earlier stuff.

We could also define say basic.mm consisting of includes $[ prop.mm $], $[ pred.mm $], $[ basicset.mm $], then graph.mm and analysis.mm would only need the include $[ basic.mm $]. This approach would be advantageous for mathboxes so we don't have to repeat 2 dozen includes in each mathbox.

Norm

On 12/9/17 2:48 AM, Jim Kingdon wrote:

Hmm, part of why I'm interested in multiple files is that I wonder whether it will make it easier to organize theorems. For example, files for ordinals and cardinals, graph theory, and analysis, which depend on basic set theory but which don't depend on each other. But that's not a detailed plan, just sort of a hunch. I see what you are saying about how (at least with some tools) having everything together can be convenient.

Anyway, the plan about being able to WRITE SOURCE in either format makes sense to me. Thanks for the update on the plans here.

david-a-wheeler commented 6 years ago

In the long run I think splitting up the files would be best (git would like it!), but I agree that improvements would make that much more pleasant. Perhaps after metamath.exe is improved, we can split up iset.mm first as a test case first (since it's smaller).

metamath / set.mm

Multi-file transition: making it happen #29

Synopsis

Rationale

What the database will look like

Tooling requirements

When?