knrafto / language-bash

Parse and pretty-print Bash shell scripts
BSD 3-Clause "New" or "Revised" License
35 stars 9 forks source link

Annotate AST with comments and source positions #14

Open pbiggar opened 8 years ago

pbiggar commented 8 years ago

It would be useful for me if comments were put in the AST in a "Comment" construct (I plan to translate them into Comments in my new language).

knrafto commented 8 years ago

This seems difficult to preserve. Comments could appear anywhere, even in the middle of commands. For example, should there be a difference between the following two examples?

if foo; then # comment
  echo hi
fi
if foo # comment
  then
    echo hi
fi
pbiggar commented 8 years ago

Typically, comments are stored as metadata on the AST node to which they are associated (you can't get that perfect, but typically you can get it pretty good). As an example, here it is in phc (a PHP compiler I worked on a lot): http://phpcompiler.org/doc/latest/representingphp.html#comments

Often, parsers will include line and column numbers in the metadata too.

I'm not sure how to do metadata idiomatically in Haskell, so I don't have a recommendation for how to do this is language-bash per se, this is just how it's typically done.

anka-213 commented 7 years ago

My first thought would be to have a general data type for storing meta-data for some node. Perhaps something like this:

data Metadata a = Metadata { comment::Maybe String, line:: Int, column::Int, value :: a }

Here is an example of a slightly different approach (https://hackage.haskell.org/package/language-javascript-0.6.0.8/docs/Language-JavaScript-Parser.html):

data JSAnnot
    = JSAnnot !TokenPosn ![CommentAnnotation] -- ^Annotation: position and comment/whitespace information
    | JSNoAnnot -- ^No annotation
    deriving (Data, Eq, Show, Typeable)

which is then used like this:

data JSBlock = JSBlock !JSAnnot ![JSStatement] !JSAnnot
knrafto commented 7 years ago

Another user contacted me requesting source positions in the AST as well.

mmhat commented 4 years ago

Concerning this issue I' d like to explore the design space of GADTs. We could move to a single datatype representing the AST:

data SimpleCommandT
data CommandT

data AST a t where
    SimpleCommand :: ... -> AST a SimpleCommandT
    Command :: AST a SimpleCommandT -> ... -> AST a CommandT
    Annotated :: a -> AST a t -> AST a t
    Comment :: String -> AST a t -> AST a t

-- Backwards compability and convenience type aliases
type SimpleCommand = AST () SimpleCommandT
type Command = AST () CommandT
mmhat commented 4 years ago

Ok, GADTs don't work because of the limitiations of the deriving mechanism. I'll dig deeper into this.

anka-213 commented 4 years ago

I don't know if it is overkill, but I can't help but think of Trees that Grow when it comes to the problem of decorating an AST with metadata.

mmhat commented 4 years ago

I had a look at the trees-that-grow solution and it looks quite promissing. (thanks @anka-213) Proposal: Move all types in Language.Bash.Syntax into one BashSyn type. Since this is a breaking change I'd like to hear @knrafto s opinion about that.

knrafto commented 4 years ago

Could you say more about what this BashSyn type would look like?

mmhat commented 4 years ago

@knrafto Here is an example:

-- The different contexts of out AST nodes.
-- Those resemble the type constructors we have in Language.Bash.Syntax.
data Command
data ShellCommand
data Redir
...

-- This decides if a data constructor is applicable in the actual context.
-- If so, we annotate the constructor with ann.
-- If not, the annotation is Void and such a value cannot be constructed.
type family Ann requested actual ann where
    Ann a a ann = ann
    Ann a b ann = Void

-- The data type for all the syntactic elements of Bash. This type has approximately 50 (!)  constructors.
data BashSyn actual ann
    = Command (Ann Command actual ann) (BashSyn ShellCommand ann) [BashSyn Redir ann]
    | SimpleCommand (Ann ShellCommand actual ann) [BashSyn Assign ann] [Word.Word]
    ...

Although this approach is not difficult to implement it requires a fair amount of boilerplate. I have a (somewhat) working implementation. The Ord, Bounded and Enum instances are missing but otherwise it seems to work. I don't have much time right now but I hope to publish it soon. Then we can still decide if we want that or not, but at least we have some prototype we can talk about.

knrafto commented 4 years ago

Hmm, that does look a bit cumbersome. What's the advantage of having a single datatype to represent all AST structures, rather than a separate datatype for each concept?