Closed tirix closed 2 years ago
@digama0 asks about the knapsack algorithm:
Do the generated proofs have a ragged right side? MMJ2 is using a knapsack algorithm to optimize the packing of the labels and make them look more "neat", and IIRC metamath.exe copied that implementation.
metamath-knife does implement a knapsack algorithm. The proof generated for ~ ellimc fits nicely the right margin:
( co wcel cv wceq vy cfv cif cmpt cc wss syl3anc climc csn cun ccnp wf wa
limcfval simpld eleq2d wi wb limcvallem mpteq2dv cab ifeq1 syl6eqr eleq1d
elab3g syl bitrd ) AEFDUGPZQEBCDUHUIZBRZDSZTRZVIFUAZUBZUCZDHIUJPUAZQZTUTZ
QZGVOQZAVGVQEAVGVQSZVGUDUEZACUDFUKZCUDUEZDUDQZVTWAULMNOTBCDFHIJKUMUFUNUOA
VSEUDQUPZVRVSUQAWBWCWDWEMNOBCDEFGHIJKLURUFVPVSTEUDVKESZVNGVOWFVNBVHVJEVLU
BZUCGWFBVHVMWGVJVKEVLVAUSLVBVCVDVEVF $.
However the proof generated for ~ numclwlk1lem2f1 does not:
( wcel cfv co cmin wceq wi wa c1 adantl cusg wbr c3 cuz w3a c2 cop cnbgra
cxp wf vp cv vu weq wral numclwlk1lem2f wf1 cc0 csubstr numclwlk1lem2fv
imp adantrr adantrl ovex fvex cword chash eqeq12d opth uzuzle23
numclwwlkovgelim syl3an3 clsw clt wb simpll simprll ad2antrl eleq1 eqcoms
cz cle eluz2 1red 3re a1i zre 3jca 1lt3 ltletr expd mpisyl 3adant1 sylbi
syl6bi com12 3ad2ant3 ad3antlr 2swrd2eqwrdeq syl3anc cr impcom eqtr3
expcom adantr biantrurd 3anan12 eqeq2 oveq1 fveq2d eqeq1d biimpd eqeq1
sylan9eq ex eqtr4d opeq2d oveq2d lsw eqeq2d mpbird eqeqan12d anbi12d
3bitr2d exbiri syl2and syl5bi sylbid ralrimivva dff13 sylanbrc ) JFUEUFZK
JPZIUGUHQZPZUIZKIHRZKIUJSRZGRJFUKKULRUMZDUNUOUPZDQZUQUPZDQZTZUOUQURZUAZUQ
UUAUSUOUUAUSUUAUUCDVAABCDEFGHIJKLMNOUTYTUUJUOUQUUAUUAYTUUDUUAPZUUFUUAPZUB
ZUBZUUHUUDVBUUBUKZVCRZIUCSRZUUDQZUKZUUFUUOVCRZUUQUUFQZUKZTZUUIUUNUUEUUSUU
GUVBYTUUKUUEUUSTZUULYTUUKUVDABCUUDDEFGHIJKLMNOVDVEVFYTUULUUGUVBTZUUKYTUUL
UVEABCUUFDEFGHIJKLMNOVDVEVGVLUVCUUPUUTTZUURUVATZUBZUUNUUIUUPUURUUTUVAUUDU
UOVCVHUUQUUDVIVMYTUUMUVHUUIUAZYTUUKUUDJVJZPZUUDVKQZITZUBZVBUUDQZKTZUUBUUD
QZUVOTZUBZUBZUULUUFUVJPZUUFVKQZITZUBZVBUUFQZKTZUUBUUFQZUWETZUBZUBZUVIYSYP
YQIUJUHQPZUUKUVTUAIVNZABCUUDEFGHIJKLMNVOVPYSYPYQUWKUULUWJUAUWLABCUUFEFGHI
JKLMNVOVPYTUVTUWJUBZUUIUVHYTUWMUBZUUIUVLUWBTZUUDVBUVLUJSRZUKZVCRZUUFUWQVC
RZTZUWPUUDQZUWPUUFQZTZUUDVQQZUUFVQQZTZUIZUBZUXGUVHUWNUVKUWAUCUVLVRUFZUUIU
XHVSUVTUVKYTUWJUVKUVMUVSVTWBUWMUWAYTUVTUWAUWCUWIWAUDUWMYTUXIUVMYTUXIUAUVK
UVSUWJYTUVMUXIYSYPUVMUXIUAYQUVMYSUXIUVMYSUVLYRPZUXIYSUXJVSIUVLIUVLYRWCWDU
XJUGWEPZUVLWEPZUGUVLWFUFZUIUXIUGUVLWGUXLUXMUXIUXKUXLUXMUXIUXLUCXEPZUGXEPZ
UVLXEPZUIZUCUGVRUFZUXMUXIUAUXLUXNUXOUXPUXLWHUXOUXLWIWJUVLWKWLWMUXQUXRUXMU
XIUCUGUVLWNWOWPVEWQWRWSWTXAWTXBXFUUFJUUDXCXDUWNUWOUXGUWMUWOYTUVTUWJUWOUVN
UWJUWOUAZUVSUVMUXSUVKUWJUVMUWOUWDUVMUWOUAZUWIUWCUXTUWAUVMUWCUWOUVLUWBIXGX
HUDXIWTUDXIVEUDXJUWNUXGUXCUWTUXFUBZUBZUYAUVHUXGUYBVSUWNUWTUXCUXFXKWJUWNUX
CUYAUWMUXCYTUWMUXAKUXBUVTUXAKTZUWJUVNUVSUYCUVMUVSUYCUAUVKUVSUVMUYCUVPUVRU
VMUYCUAZUVPUVRUVQKTZUYDUVOKUVQXLUVMUYEUYCUVMUYEUYCUVMUVQUXAKUVMUUBUWPUUDU
UBUWPTIUVLIUVLUJSXMWDXNXOXPWTWSVEWTUDVEXIUVTUWJUXBKTZUVNUWJUYFUAZUVSUVMUY
GUVKUVMUWJUYFUVMUWJUXBUWGKUVMUWPUUBUUFUVLIUJSXMZXNUWIUWGKTZUWDUWHUWFUYIUW
HUWFUYIUWFUYIVSUWEUWGUWEUWGKXQWDXPXFUDXRXSUDXIVEXTUDXJUWNUWTUVFUXFUVGUWMU
WTUVFVSZYTUVMUYJUVKUVSUWJUVMUWRUUPUWSUUTUVMUWQUUOUUDVCUVMUWPUUBVBUYHYAZYB
UVMUWQUUOUUFVCUYKYBVLXBUDUWMUXFUVGVSYTUVTUWJUXDUURUXEUVAUVNUXDUURTUVSUVKU
VMUXDUVLUCSRZUUDQUURUUDUVJYCUVMUYLUUQUUDUVLIUCSXMXNXRXIUWDUXEUVATZUWIUWDU
YMUXEUWBUCSRZUUFQZTZUWAUYPUWCUUFUVJYCXIUWCUYMUYPVSUWAUWCUVAUYOUXEUWCUUQUY
NUUFUUQUYNTIUWBIUWBUCSXMWDXNYDUDYEXIYFUDYGYHYHYIYJVEYKYLYMUOUQUUAUUCDYNYO
$.
It seems that metamath-knife only runs the knapsack algorithm on a subset of the labels. First on the first twenty, then on the next hundred. I believe this is a trade-off to avoid shuffling too much the statements, which are ordered by how often they are referenced later in the proof. This is important for the first twenty statements, which only need one letter. That may limit the algorithm.
Also, in this particular example, there are very long labels, like ~ 2swrd2eqwrdeq, which probably also make the knapsack's algorithm's work harder.
There must be something else beyond the knapsack algorithm, though. With the proof of ~ ellimc, for which the label list is correctly packed (but ordered differently), the encoded proof is 4 characters longer than the original proof in set.mm.
In my experiments, I let metamath-knife read and parse the original proof, but it then rebuilds the compressed proof from scratch.
I believe it would be a good idea to copy the algorithm from mmj2/metamath, mostly for byte-reproducibility rather than efficiency per se.
IIRC the sorting is limited to within groups 1-20, 21-120, etc as you say, since we don't want to make the compressed text longer either. That doesn't explain the result of numclwlk1lem2f1 though, because there are 39 words before the big word numclwwlkovgelim
that causes the gap, and you can clearly do better by shuffling some of the other words around it (single big words aren't usually a problem, you usually only get bad results when there aren't enough small words). In fact, from these examples I would be more likely to guess that it is a simple greedy algorithm, and ellimc just got lucky. Big words causing a ragged right is a characteristic of the greedy algorithm.
The metamath-knife program now produces correct proofs, however it seems their size is a bit longer than compressed proofs generated by MMJ2 or the metamath program.