Open reitzig opened 9 years ago
Yes, constructing a try is certainly one way to compress this. Should have thought of that.
Given a trie you can indeed get a regular expression in a straight forward fashion by applying
regify(a[]) = a
regify(a[c_1, ..., c_k]) = a( (regify(c_1)) | ... | (regify(c_k)) )
Given that Mathematica code for constructing tries seems to be available, how hard is it for you to try how well this works (in terms of compression)?
(After gzipping the effect may be zero since gzip is likely to exploit lots of patterns in the original, too. See also here.)
By the way, that is not the same as finding the minimal DFA. For instance, SomeFunction
and AnotherFunction
get independent chains in a trie, but share (at least) a subautomaton for Function
in a minimal DFA.
Right, definitely not minimal, but smaller. Merging the suffixes (Function
is a good one), in addition to compressing non-branching portions of the tree while not likely to be minimal, may be good enough.
@reitzig The big advantage of the verbose, non-reduced form is that
In addition to that, the current version seems to run smoothly on our site (and on the Wolfram site, which uses my highlighter too!). I never heard of the argument that the file-size of the highlighter is the reason that the Mathematica language is not installed network-wide. Can you give sources for this?
If this is indeed the case, than one could easily try to construct a reasonably reduced form. It would definitely be handy that Mathematica has so many prefixes it shares among many functions. On the other hand, one should not forget that a highlighter with over 5000 function, even in packed form, will be larger by magnitudes than all other language-highlighter.
@halirutan ad 1: Of course you'd want to implement an algorithm that gets you from the list of names to a more compact representation. That one can be re-run after every update.
ad 2: If you implement said algorithm correctly, you can check the list version and trust that it's included in the compact representation as well.
ad sources: From the post I linked above:
We will not include it on Stack Overflow, only here on mathematica.se. There are several reasons for this. [...] It is huge. [...] For perspective: If we included it in our regular prettify JavaScript file (which is served to almost all of the millions of visits that Stack Overflow gets every day), it would more than triple that file's size. And that's after gzipping.
Yes, it'll probably not become as compact as more simple token specifications.
(That's another line of thought entirely: why not go with a compact approximation? Say, make all tokens that start with a capital letter keywords. That won't allow to distinguis between real keywords and those in namedCharactersArray
; would that be an issue? If so, one can also say "everything that starts with a capital and is not in namedCharactersArray
".)
@reitzig
Of course you'd want to implement an algorithm that gets you from the list of names to a more compact representation. That one can be re-run after every update.
What I meant is that I don't have to do any additional work :-)
About ad 2: What happened in the past was that people found missing/additional symbols while writing up posts. They were then just coming here to look whether I have forgotten them. But you are right, this shouldn't be a concern here.
The link to my feature-request post on the other hand clearly indicates (and that was how I remembered it) that size is only the 3rd point. More important is that they will never go along with our coloring-scheme. And we won't step back from that even a tiny bit, because having colors that mimic the Mathematica front end is important to many users.
Especially, @balpha said about the size
it would more than triple that file's size. And that's after gzipping.
Have you tried how good gzip is at compressing the a|ab|ac|...
style of code? I suppose it does already a good job removing redundancy and that optimizing the list of function-names will not be enough to get an acceptable filesize for SO.
That being said, let me make 2 suggestions:
ad colors: I thought that having a branch with SO-compatible colors would be easy, and the result so much preferable to default highlighting that it would be worth the effort (assuming size was dropped).
ad size: yes, I suspect that gzip does most of what is possible. That said, it's "dumb" in the sense that it can not exploit deeper structure in the input (cf compression by formal grammars). So it's not impossible to get a smaller result by doing higher-level preprocessing. If I were to guess, though, I'd say that a list of words like this one here is close to the best case for gzip and such.
ad next steps: I don't see myself in a position to actively contribue. I only had the idea. Sorry.
I note that the regular expression created by your Mathematica function is verbose (and likely inefficient to match against) since it is just a big disjunction of the individual tokens.
Obtaining minimal DFA from regular expressions is hard (i.e. has prohibitive cost) in general as the NFA -> DFA step may blow up the size of the automaton, but it may work out here.
Plus, it's not clear what size the regular expression we get by one of several algorithms will be.
Worth a try, anyway, I think.
Motivation: Mathematica highlighting has not been included into the SE-wide highlighting solution because of the size of the file(s). If they can be shrunk, chances are we'd be able to use Mathematica highlighting on other sites as well.