unparser? - Githubissues

grosu commented 9 years ago

@dwight, why is the "unparser" blocking us? @cos told me that you told him so. @msaxena2 said that he us already using an unparser in his debugger that you implemented, and that works ok for now.

Our main priority is to make the examples in the distribution all work. Yes, for some of them we do have some configuration fragments in the .out files, but not for many. Also, we do have a way to ignore the whitespaces and the brackets, so the exact pretty-printing should not be an issue. In fact, I believe that we should start with a naive but correct and fast pretty-printer (i.e, the pretty-printed program to parse to the same AST), and use that until we migrate all the tutorial examples, and maybe even after (in the .out files).

dwightguth commented 8 years ago

For reference, here is the comparison of the output of executing collatz.imp before/after the switch to KORE:

Before:

<T>
    <k>
        .K
    </k>
    <state>
        m |-> 2
        n |-> 1
        q |-> 1
        r |-> 3
        s |-> 66
    </state>
</T>

After:

<T> <k> .::K </k> <state> s |-> 66 ` r |-> 3 ` q |-> 1 ` n |-> 1 m |-> 2 ` ` ` </state> </T>

You may decide for yourself whether you think doing something about this should be a high priority task, but I will note that the indentation in particular will be significant if we want the output to be remotely readable for large configurations.

grosu commented 8 years ago

@dwightguth , as development leader of K, you are the best person to understand the big picture of the transition to K 4.0. I would be interested in seeing all the major steps left in one place, for example using the milestone https://github.com/kframework/k/milestones/Update%20Tutorial%20for%20KORE that @bmmoore started. Can we add all the required steps to transit, as separate tasks? If that is done, then we can assign people to them.

Regarding the unparser, the output you were showing is actually even wrong, isn't it? Can't we have a trivial but correct pretty-printer as start, and then worry about the fancy unparser later, once everything works? The unparsing problem needs to be studied properly, discussed in comparison to existing approaches, designed, and then implemented. I just don't want @cos to spend a couple of weeks working on something adhoc by translation to Keama, which will then be thrown away (that's what happened each time we reduced our problems to existing tools).

Regarding a trivial but correct pretty-printer, that is something that we want to have anyway in the final system, don't we? People will want to use that whenever the smart unparser fails to do the job well, but still don't want to use the uglier KAST notation. Such a pretty printer would basically traverse the KAST and print things using concrete syntax with all the brackets. Some simple optimizations may be considered, such as: use user-defined brackets for the syntactic categories they defined brackets for; avoid brackets around constants (?).

In fact, I have no idea even if the two "trivial" optimizations above are correct. @bmmoore showed at some moment that the unparsing problem was very hard, harder than parsing. Hopefully the trivial pretty-printing problem is not that hard. But since this is needed anyway, and since it is much simpler than unparsing and still not totally trivial, my suggestion would be to do this first. Then modify the .out files to use it as we make the tutorial examples work. There aren't too many of them using actual K terms, most of them use the stdout.

grosu commented 8 years ago

For the design of the unparser, please study the Maude format attribute, too: http://maude.cs.uiuc.edu/maude2-manual/maude-manual.pdf. I don't think we should just implement the unparser. We should first design it, discuss the tradeoffs and the algorithms, and then implement it.

This is about the unparser; regarding the raw pretty printer, I think we should just implement it asap, in case it is not implemented.

dwightguth commented 8 years ago

Note that one thing that must be done as part of this is the coloring support, and the resulting influence of the --color flag to krun.

dwightguth commented 8 years ago

@grosu I am not 100% sure that I have caught everything that needs to be handled but the milestone should now contain all the major features that I expect to be necessary. Anyone else can add stuff they think is still missing.

grosu commented 8 years ago

Getting back to the minimal pretty-printer: @cos, @bmmoore , can you please remind everybody what we agreed in the K meeting that we should do here? Basically, all we want is a minimal (in terms of intelligence) but correct pretty-printer for now, despite the fact that will end up putting lots of brackets and the fact that will be incorrect in some rare, artificial cases. Like with other things, we will focus on a powerful solution once we migrate with all the test passing.

bmmoore commented 8 years ago

For the big picture, we agreed not to worry too hard about exotic sources of unsoundess, and instead implement a basic algorithm and guarantee soundness by checking that the result reparses, falling back to KAST output in the hopefully rare case that it fails. The specifics of how to unparse were not as settled, but the very first step was a stupid unparser that puts brackets anywhere possible, and then the next step was to find standard algorithms that take advantage of precedence information within a single AST but don't worry about potential ambiguity with trees of completely different shape (like https://www.cs.tufts.edu/~nr/pubs/unparse-abstract.html , except that doesn't handle "mixfix", or what @cos mentioned of Kiama having an unparser).

dwightguth commented 8 years ago

The standard algorithm you are talking about is actually already implemented. What is needed from the unparser at this point is the stuff that deals with line breaks and indentation (ie, the format attribute).

On Fri, Sep 4, 2015 at 10:47 AM, bmmoore notifications@github.com wrote:

In general, we agreed not to worry too hard about exotic sources of unsoundess, and instead implement a basic algorithm and guarantee soundness by checking that the result reparses, falling back to KAST output in the hopefully rare case that it fails. The specifics of how to unparse were not as settled, but the very first step was a stupid unparser that puts brackets anywhere possible, and then the next step was to find standard algorithms that take advantage of precedence information within a single AST but don't worry about potential ambiguity with trees of completely different shape (like https://www.cs.tufts.edu/~nr/pubs/unparse-abstract.html , except that doesn't handle "mixfix", or what @cos https://github.com/cos mentioned of Kiama having an unparser).

— Reply to this email directly or view it on GitHub https://github.com/kframework/k/issues/1662#issuecomment-137773004.

grosu commented 8 years ago

But is it correct? The output your showed above does not seem correct, it has two unmatched backticks.

We should just go with this trivial unparser then for now in the tutorial. The same way we used to ignore the matched parentheses before in the .out files, we should now have an option of ktest to ignore the backticks. Maybe also the brackets. Anyway, what we want is to avoid having to modify the output files in the tutorial each time we modify the pretty printer. We can and should, of course, test the pretty printer separately.

dwightguth commented 8 years ago

It actually parses correctly. You are simply reading the backticks incorrectly (which is, sidebar, the fear that I had which made me not really like the backtick syntax in the first place).

That said, the reason the backticks exist at all even though they should not be needed is that collections are left associative but the Java backend is generating right-associative terms. This has nothing to do with the unparser though; it is something for @andreistefanescu to fix.

I would suggest though that instead of reduplicating the old hack, we simply fix the collections so that the backticks are no longer required. We have had issues in the past that arise ultimately from the fact that we are ignoring brackets in ktest, I would prefer to get rid of that hack if we can.

grosu commented 8 years ago

Are we allowing a whitespace alone inside a backtick bubble? Is that what I'm missing, the fact that the last two backticks just surround a white space?

collections are left associative but the Java backend is generating right-associative terms

Or use and take the assoc attribute into account, because we want collections to be associative (in order to match terms in the middle of them.

I would prefer to get rid of that hack if we can.

Me, too, but I would still like, at the same time, to not have to modify the .out files each time we modify the unparser (more or fewer brackets, adding or removing whitespaces or formatting, etc.). How about this. Let's give ktest the option to compare outputs at the AST level, which means that in that case it would parse the output with the configuration parser and then compare the ASTs. Would this work?

dwightguth commented 8 years ago

You are actually missing something else, namely, that the first three backticks in the term are all begin-brackets, and the last three are all end-brackets. The fact that these aren't separate characters is what is introducing the ambiguity you see, it's just that the parser is smart enough to resolve them correctly.

As for the assoc attribute, you are right, we can probably skip brackets for cases where the production has that attribute, I didn't think of that.

As for parsing as ASTs, this doesn't really work because what about search results or other types of atypical .out files? What about .out files that mix program output with the output of krun?

I actually maintain we should modify the .out files every time we change the unparser because otherwise we run the risk of missing regressions if a particular necessary bracket for unparsing gets unexpectedly dropped. I have actually run into cases where this was occurring and we didn't catch it until later.

grosu commented 8 years ago

You are actually missing something else, namely, that the first three backticks in the term are all begin-brackets, and the last three are all end-brackets.

I missed that because I could not believe it that it happened. It is actually wrong. That is not allowing us to do nested grouping, which is necessary. Please read this again:

https://github.com/kframework/k/wiki/!!!-Module-System#k-code-grouping

So we need to modify this simple pretty-printer. As it stands, it probably does not even satisfy the basic property parse(print(AST)) = AST.

grosu commented 8 years ago

I actually maintain we should modify the .out files every time we change the unparser because otherwise we run the risk of missing regressions if a particular necessary bracket for unparsing gets unexpectedly dropped. I have actually run into cases where this was occurring and we didn't catch it until later.

This sounds reasonable for now, I am OK with it. I hope ktest still has that feature that allows us to rewrite a bunch of .out files automatically.

dwightguth commented 8 years ago

See and this is why I really don't like the backtick syntax in the first place, because it requires context-sensitive parsing to parse. If you want to modify the pretty-printer to fix this case, when we better modify the parser to fix it too, only then K has a context-sensitive grammar and will require a context-sensitive parser to parse, which we don't have and would be way too slow even if we did have it. Why can't we just have matching characters for the brackets like any sane programming language?

cos commented 8 years ago

I agree @dwightguth in principle -- matching brackets are much easier to follow and, as UTF8 is prevalent now, we could even use some funky but good-looking brackets to not clash with the user language (e.g., ⦑ ⦒) and have something uncommon (e.g., <. .>) as the ASCII alternative.

In practice, we already have something that is working in most (if not all) cases, so it might be a good idea to stick to it, at least for now.

grosu commented 8 years ago

You guys should have fought for this when the !!! Module System page was written and asked for comments. I think I made it quite clear that these pages starting with !!! become final design documents, that will become part of the manual and need to be implemented as such. I just resent the message stating that to the k-list.

I am not that concerned about parsing the backtick bubbles. After all, it is just a language construct and since K is supposed to define any language, we will have to find a mechanism to parse it properly anyway eventually.

One way to do it is to parse it in stages: first use a grammar that only parses the top-level bubbles, then another grammar to parse the rest, then use the parser again recursively to parse the bubble contents, and so on and so forth. Many languages are not context free and we will have to eventually parse them and give them semantics. Well, let's start with K itself then.

Another way to do it, suggested by @bmmoore to stay context-free, is to actually define a fixed number of brackets of the form

  syntax K ::= "`" K "`" [bracket] | "``" K "``" [bracket] | "```" K "```"  [bracket] | ...

and to have the ones with more backticks have priority over the ones with fewer backticks. I am OK with this as a temporary tradeoff, so we do not get stuck with parsing issues before we release K4.0. And yes, as @dwightguth pointed out to me, this will allow undesired nested bubbles with more backticks inside, such as stuff1stuff2stuff3, and ambiguous terms such as stuff1` stuff2 `stuff3, when each of stuff1, stuff2 and stuff3 parses fine by itself. But we can deal with these in a post-parsing processing, and even give nice error messages to people.

grosu commented 8 years ago

Here is another solution, perhaps the simplest of them all. Let us only define one bracket, with single backticks,

  syntax K ::= "`" K "`"  [bracket]

and then, when parsing with ambiguities, let us give priority to outermost parses of the form `_(_(stuff))`. We may want to do that in general with all brackets, in which case we are done with this discussion for good.

bmmoore commented 8 years ago

@cos matching brackets might be nice, but even just correctly implementing the decreasing number of backticks would make an immense improvement in readability:

@dwightguth we specifically settled on doing tricky things with backticks to avoid clashing with the concrete syntax of "saner" languages. Ascii only has <>[](){} for paired characters, all widely used. As far as parsing with ordinary tools goes, counting brackets is actually much easier than resolving ambiguities, it just takes a simple lexer hack to work in flex+bison.. Also, it's only kast that we really expect people to independently implement parsers for, and there brackets are only ever necessary to delimit the scope of rewrites, so we don't need nesting or a fancy parser at all.

@grosu That's an interesting suggestion. A more complicated syntax is needed to prevent backtics with whitespace between them from being counted together, and backtics without whitespace from being split apart into separate groups, but existing features are enough (we'd need an auxiliary sort and some regex terminals with lookaround assertions).

I really like the idea of using unicode brackets here. I think it's entirely reasonable to have an option or default to use unicode in our command-line output, and handle it in parser as input. I'd rather we avoid requiring unicode support from people writing external tools that handle kast, but the backtics are easy to handle in that limited setting, and this sort of mixed K+concrete syntax already requires the full power of our parser for the concrete syntax bits, and also output meant for program transformation needs to entirely avoid K brackets however pretty they are, so I don't see any harm in the suggestion. Here's a nice table of unicode brackets: http://stackoverflow.com/questions/13535172/list-of-all-unicodes-open-close-brackets . @radumereuta, does the parser correctly handle unicode? It seems to work, but I'm not sure whether that's just a result of treating UTF-8 as a stream of bytes.

grosu commented 8 years ago

I would prefer to not give unicode any special treatment and certainly not depend on people knowing about it or using it in order to use K. We have to be able to give languages using unicode a syntax and semantics just like for any language, so reserving special unicodes for K has the same conceptual problem as reserving normal characters. This just complicates our existence right now and slows us down with our transition.

We really need to move on with this. So the simplest solution that works OK should be fine for now.

So we already have

  syntax K ::= "`" K "`"  [bracket]

right? Let's just keep it as is then. Also, we have brackets for many builtins, like INT, BOOL, etc., and the pretty printer uses the provided bracket is available instead of backticks, right? To avoid seeing backticks in outputs, let us then just add brackets to maps, lists and other collections. This should be really trivial to do and solves our problem for now.

cos commented 8 years ago

@grosu, unicode brackets have one less conceptual problem than ASCII ones: most languages reserve all ASCII brackets. No (afaik, I'm not talking fringe) language reserves all unicode characters -- most don't reserve any. Even in the future, when languages will embrace unicode more, I don't see why a practical language would want to reserve all unicode brackets.

This means that even with user-configurable brackets, unless we have multi-chracter brackets, ASCII will always clash with the user language. User-configurable brackets, on the other hand, should work in all cases.

That being said, I don't think this is our most important issue now so, for practical purposes, I agree we should simply go with the existing solution, backticks, and work around any issues.

grosu commented 8 years ago

@cos OK, we'll get back to this when the time comes. For now, let us please go ahead with the above. That is, just

 syntax K ::= "`" K "`"  [bracket]

plus adding explicit parentheses as brackets to pre-defined things. @dwightguth : how long does this take? Should we make into a task? Are you going to work on it or you want me to find somebody at UIUC?

dwightguth commented 8 years ago

I believe what I wanted to propose was that we add an unparser-avoid and unparser-prefer attributes that would tell the unparser to avoid the backtick production where possible. This should take less than a day to implement, but you should make it a task and assign it to someone. That same person should probably also modify the unparser so that productions that have the assoc tag are treated as if brackets are not needed within them. That is also just a couple hours of work.

grosu commented 8 years ago

@dwightguth But why not always use the user-defined brackets instead of backticks whenever possible, and use the backticks only when there are no user-defined brackets? Don't we already do this for things like (3+5)*7?

I'd prefer to avoid adding attributes specific to unparsing at this stage. If really needed, we can add them later, after we transit to KORE.

dwightguth commented 8 years ago

Grigore, the unparsing algorithm doesn't have any notion of privileged syntax. Everything is just a production. This attribute on the production in kast.k is what would tell the unparser to do such a thing.

bmmoore commented 8 years ago

@grosu I am also totally against requiring users to write or understand unicode to use K, but as long as they are not expected to write it, I don't see the harm in having an option to let them look at prettier output. Software support is widespread enough there shouldn't be many technical issues with optionally outputting unicode. The reason to consider an option now is that actually producing grammatical output from the production

syntax K ::= "`" K "`"  [bracket]

requires changing code to count brackets, while just having a flag to feed the unparser a grammar with a production like this instead

syntax K ::= "⟨" K "⟩" [bracket]

would be enough to get output like

<T> <k> .::K </k> <state> s |-> 66 ⟨r |-> 3 ⟨q |-> 1 ⟨n |-> 1 m |-> 2⟩⟩⟩ </state> </T>

bmmoore commented 8 years ago

So, I think we have this split into three smaller tasks, and hope that just doing the first two will go a long way towards solving this:

Make sure the unparser prefers user-provided brackets over K brackets
Don't put unnecessary brackets around associative labels
Make the unparse use increasing numbers of backticks for nested brackets.

By the way, our meeting room computer (still running XP) doesn't know those unicode brackets I used in the previous comment.

laurayuwen commented 8 years ago

I think I can have a look at the middle one tomorrow, but since I was away for long, I cannot guarantee to solve it within hours.

laurayuwen commented 8 years ago

I don't acutally have the problem of "unnecessary brackets around associative labels" as I just tested on latest commit 8e9603fb6e4df3cff6e8a7af88c9ec6c0c2990c7 The output of collatz.imp is perfectly fine.

grosu commented 8 years ago

@laurayuwen I'm assuming you are trying it with the --kore option everywhere, right?

@dwightguth have you done anything to fix this? Or somebody else?

dwightguth commented 8 years ago

Ask @andreistefanescu , it would have been a change to the rewriter to make collections be converted to K in left associative form instead of right associative form. Nothing has actually been done to make assoc collections not require a specific associativity; that task is still something that needs to be done.

laurayuwen commented 8 years ago

@grosu Oh, my fault.

laurayuwen commented 8 years ago

@dwightguth sorry the problem is still there.

laurayuwen commented 8 years ago

@dwightguth I can only find the unparsing to KAST format, but where does pretty printing from KAST to concrete syntax happen? Could you point me the file?

grosu commented 8 years ago

@laurayuwen , so what is the status of this? Please do not let anybody slow you down with this, not even @dwightguth. Contact him on hangouts if he does not answer, or ask somebody else.

grosu commented 8 years ago

@bmmoore regarding the unicode fonts, I've never seen any of those you used on any of my computers so far. I only see a little box with a question mark inside.

laurayuwen commented 8 years ago

@grosu Yes，I asked him on hangout yesterday, still working on that. I will try to finish during weekends.

edgar-pek commented 8 years ago

After I am done with kdoc I'll also look into this. ETA: Wednesday/Thursday?

bmmoore commented 8 years ago

@grosu I'm a bit surprised it affects you, but the general risk they might not be supported is why I'm definitely against depending or insisting on unicode, even though I do think it would be nice as an option. Both Linux and relatively recent versions of windows have good unicode support, are you perhaps still running XP (which as we saw with the meeting room computer doesn't know those brackets)? How many characters show up properly in the list in the first answer at http://stackoverflow.com/questions/13535172/list-of-all-unicodes-open-close-brackets

grosu commented 8 years ago

66 of them are ? or ? in a box.

I am not against unicode. I think we should allow people to use them as symbols in their production terminals, just like any other characters. In particular, if they want to change the K bracket they should be allowed to just modify their kore.k and have their own, unicode-based K brackets.

kframework / k-legacy

unparser? #1662