openconfig / ygot

A YANG-centric Go toolkit - Go/Protobuf Code Generation; Validation; Marshaling/Unmarshaling
Apache License 2.0
286 stars 107 forks source link

Restructure Code Generation in ygen #412

Open robshakir opened 4 years ago

robshakir commented 4 years ago

ygen started out as a library that was solely handling generation of Go code, subsequently it has transitioned into being one that generates three different outputs:

This means that we have a large multi-purpose package that is becoming complex to maintain. Since the only consumers of the public APIs are within our generator binaries, then we have a bit of scope to be able to shuffle this package around - and this should definitely be done prior to a 1.0 release of ygot.

To this end, I suggest that we restructure this such that we have:

This refactoring should likely be done in a few stages:

This was originally:

* [ ] Mechanical move of Go code generation into `gogen`.
* [ ] Mechanical move of Proto code generation into `protogen`
* [ ] Refactoring the generation of `Directories` to be disconnected from the generation of Go (today, there is a linkage to the Go generation since we do not abstract the state storage interface).
* [ ] Refactoring `ygot/gogen` to use the abstract Directories as input.
* [ ] Refactoring `ygot/protogen` to use the abstract Directories as input.

This is a fairly major restructuring, so we'll likely go through some changes to APIs that are incompatible with each other to start with, so we should be careful about figuring out when to cut the next version of ygot - and make sure that this is an RC for 1.0.

robshakir commented 4 years ago

Looking at this, I think this is actually better to restructure the other way around than I suggested initially. So - we'd internally restructure the code to be able to use the same pattern that ypathgen uses - such that we use the pattern shown below:

ygot

This will mean that the first set of changes restructure GenerateGoCode and GenerateProtoCode internally to be able to rely on this pattern - and then once this is done, we then pull out the code that is now language specific into the packages that I suggested above.

There are a few changes here -- particularly, in the Go code generation, we lazily perform generation of types - however, with the changes that @wenovus has made recently, then we're already doing type generation pre-emptively for the enumerated types, and for ypathgen we do type generation pre-emptively too. There's a big advantage to doing this - which is that we do not then need to carry around the state of how we named a particular enum through the flow of generating the Go code. This perhaps was fine when we were just doing one language, but now seems less useful.

robshakir commented 4 years ago

https://github.com/openconfig/ygot/commit/bc8585c6386371a65faf44f88547175e2561211f is a start at this restructuring - it looks fairly large in terms of a change, but I think we're actually doing OK here, and the test coverage is helping.

There are three places in GetDefinitions (which is GetDirectoriesAndLeafTypes in the above example) where we're doing something Go-specific today:

This looks a much cleaner change than the one that we would need to make to do things in the order suggested above.

wenovus commented 4 years ago

This looks good! I like and agree to the design of having goStructName() and resolveKeyTypeName() (the function parameters to buildDirectoryDefinitions) as part of a configuration passed into GetDefinitions. The nuance is that the intermediate representation output of the new ygen still contains language-specific, non-abstract, information (e.g. MappedType); so, I think of the new ygen as a step that repackages and renames the YANG information from the raw AST into a language-agnostic structure with language-specific names. Thus, passing these language-specific parameters is part of what the new ygen does.

What I like most about the refactoring is the clarification of the intermediate representation (AST->IR->code). The IR is compact and grouped into a single return type (copied below). I'm thinking that our efforts here cannot ensure that the IR will contain whatever a new language can throw at us, but the clarified information makes it easier to see the gaps between what the IR provides, and what a <newlang>gen requires.

type Definitions struct {
    Directories   map[string]*Directory
    LeafTypes     map[string]map[string]*MappedType
    Enums         map[string]*EnumeratedType
    ModelData     []*gpb.ModelData
    MappedPaths   map[string][][]string
    ParsedModules []*yang.Entry
}
robshakir commented 4 years ago

Thanks for the thoughts here -- as I've been through the few attempts at figuring this out over the last few days, this comment, and one that you made last week have been really useful. I think thinking about this as a compiler problem is really useful -- such that we think about going from concrete YANG input, to an IR, and then to compiled code. Thus far there was a lot of layering violation happening in the code base for this. Thus, I ended up sketching something like this out;

Screen Shot 2020-07-15 at 5 09 03 PM

Essentially:

Design Discussions

I'm interested in your thoughts and review on the following design decisions that I think came out of the prototyping that's in the various restruct-ygen branches that I've been working on.

Naming

As you say, our IR actually takes into account some of the naming that we do within the code generation. I found that inserting something language specific was relatively easy with this interface:


type LangMapper interface {
    // DirectoryName maps an input yang.Entry to the name that should be used in the
    // intermediate representation (IR).
    DirectoryName(*yang.Entry, genutil.CompressBehaviour) (string, error)
    // KeyLeafType maps an input yang.Entry which must represent a leaf to the 
    // type that should be used when the leaf is used in the context of a 
    // list key within the output IR.
    KeyLeafType(*yang.Entry, genutil.CompressBehaviour) (*MappedType, error)
    // LeafType maps an input yang.Entry which must represent a leaf to the
    // type that should be used when the leaf is used in the context of a
    // field within a directory within the output IR.
    LeafType(*yang.Entry, genutil.CompressBehaviour) (*MappedType, error)

    // SetEnumSet is used to supply a set of enumerated values to the
    // mapper such that leaves that have enumerated types can be looked up.
    SetEnumSet(*enumSet)

    // SetSchemaTree is used to supply a copy of the YANG schema tree to
    // the mapped such that leaves of type leafref can be resolved to
    // their target leaves.
    SetSchemaTree(*schemaTree)
}

Using this type - any function that produces the IR can use the LangMapper supplied to resolve those language-specific names that are used in the generated code. The DirectoryName function can be stateful (and likely needs to be), but that state is maintained on a per LangMapper implementation (some languages may require global uniqueness, whereas others might have scoping).

I'm proposing to give the LangMapper some lookup functions too:

Enumerations

As you observed in one of the commit comments, the enumeration parsing that I'd done went a bit too far insofar as it made the assumption that the only requirement was a map of integer IDs to code names, and then a map of integer IDs to YANG strings. Since some languages might want enumerations to be handled differently based on whether they're global or scoped to a message, this doesn't really make sense. I therefore added an interim representation of a parsed enumeration or identityref, particularly:

type EnumeratedValueType int64

const (
    _ EnumeratedValueType = iota
    SimpleEnumerationType
    DerivedEnumerationType
    UnionEnumerationType
    IdentityType
)

type EnumeratedYANGType struct {
    // Name is the name of the generated enumeration.
    Name         string
    // Kind indicates whether the type is an enumeration, or an identityref.
    Kind         EnumeratedValueType
    // ValuePrefix stores any prefix that has been annotated by the IR generation
    // that specifies what prefix should be appended to value names within the type.
    ValuePrefix  []string
    // TypeName stores the original YANG type name for the enumeration.
    TypeName     string

    // identityBase is populated if the enumerated type was an Identity.
    identityBase *yang.Identity
    // enumType is populated if the enumerated type was an enumeration.
    enumType     *yang.EnumType
}

This representation allows downstream consumers to form their own output code - but removes as much of the complexity as possible from that handling code -- since they can only now deal with whether:

The only reason that we care about these different cases is that the code generation for (e.g., proto) is different for these cases - where the first and third cases are not generated as global enumerations, but rather scoped ones.

Final IR type

Thus far, the IR type that I'm proposing is as follows. Right now, the Go code can be entirely generated from this. I haven't yet restructured the proto generating code, but I think that it can also be generated using this IR too.

type Definitions struct {
    // transformationOpts stores details of the schema transformations that were
    // made when generating the IR, such that calling code can retrieve these
    // elements without needing to be aware of the transformations.
    transformationOpts TransformationOpts
    // parsedModules caches the set of modules that were parsed for generating
    // the IR.
    parsedModules []*yang.Entry

    // Directories is the set of containers (structs, messages) that should be
    // generated from the generated code. The map is keyed by the path to the
    // directory.
    Directories map[string]*Directory
    // LeafTypes stores the type that each leaf has been mapped to in the generated
    // code. Directory entries are stored in the map for completeness, but are
    // represented as a nil mapped type.
    LeafTypes map[string]map[string]*MappedType
    // Enums is the set of enumerated entries that are extracted from the generated
    // code. The key of the map is the global name of the enumeration.
    Enums map[string]*EnumeratedYANGType

    // ModelData stores the metadata extracted from the input YANG modules.
    ModelData     []*gpb.ModelData
    // MappedPaths stores the set of paths that each node within the YANG
    // tree is mapped to.
    MappedPaths   map[string][][]string
}

func (d *Definitions) SchemaTree() ([]byte, error) {
    rawSchema, err := buildJSONTree(d.parsedModules, d.Directories, d.RootEntry(), d.transformationOpts.CompressBehaviour.CompressEnabled())
    if err != nil {
        return nil, err
    }
    return rawSchema, nil
}

func (d *Definitions) RootDirectory() *Directory {
    if !d.transformationOpts.GenerateFakeRoot {
        return nil
    }
    v := d.Directories[fmt.Sprintf("/%s", resolveRootName(d.transformationOpts.FakeRootName, defaultRootName, d.transformationOpts.GenerateFakeRoot))]
    return v
}

func (d *Definitions) RootEntry() *yang.Entry {
    rd := d.RootDirectory()
    if rd == nil {
        return nil
    }
    return rd.Entry
}

func (d *Definitions) RootName() string {
    rd := d.RootDirectory()
    if rd == nil {
        return ""
    }
    return rd.Name
}

The one place that I have some design concerns is the MappedPaths field here -- it feels like the time that one wants to access this is in the context of outputting a particular leaf -- which makes me feel like LeafTypes might not be the right return type -- if we rather had something like:

type LeafDetails struct {
  Paths [][]string
  Type *MappedType
}

then this might feel a bit cleaner. Equally -- rather than have two different Directories and LeafTypes outputs, we could rather converge Directories and LeafTypes and make the Leaves within Directory contain the type above. This seems to have some further implications for ypathgen though - so might be a step too far.

robshakir commented 4 years ago

I pushed to https://github.com/openconfig/ygot/tree/restruct-ygen-interface the diff size for this (including updating tests is +3170, -1732.

To split this out, I think the plan is likely:

If we agree on this design over the next few days I can start on that direction.

wenovus commented 4 years ago

As discussed I agree with the design choices wholly. Using LeafDetails within Directory will require some mechanical refactoring in ypathgen that actually makes its code cleaner, so it's a welcome change.

I agree with using the LangMapper interface for the input into the new ygen. A quick look finds that most of ygen's functionality can be split into 1. Directory and leaf type generation, and 2. inserting the information into templates. Having ygen split on this line looks to do a good job of significantly reducing the complexity of the resulting packages.

wenovus commented 3 years ago

May want to consider how to fit the IR to allow adding a pluggable callback when generating code: https://github.com/openconfig/ygot/issues/522

wenovus commented 2 years ago

@robshakir I'm going to pick this up from your branch at https://github.com/openconfig/ygot/tree/restruct-ygen-interface and https://github.com/openconfig/ygot/tree/ygen-ir-langmapper.

wenovus commented 2 years ago

So a few proposals to reduce the complexity of the design as well as to allow for custom user passes:

wenovus commented 2 years ago

@robshakir I think the "IR generation residing in an internal package" might be contentious, so I want to raise your attention on this one. In particular currently-exported names like MappedType will require changes in my current draft of proto generation and will become a protected name under this proposal. I'd argue making this type protected would help with maintainability in the future. It would also set the stage for a potential move of some of the ygot-specific util definitions and functions into this protected package. For reference here are all of the current exported names: https://pkg.go.dev/github.com/openconfig/ygot/ygen

robshakir commented 2 years ago

So a few proposals to reduce the complexity of the design as well as to allow for custom user passes:

  • All code should be put into a template structure before being passed to templates for string generation. This allows custom passes in the per-language generation. These passes go in between the output of gogen/protogen and when the template structures are finally fed to the templates.

Essentially here, you're saying that there should not be methods called from the templates, it should be done as a pre-processing step? I think that this means that we introduce another stage into the diagram at the top of this comment which says "templates" after "lang gen". We would impose the restriction that the "templates" step can only do population of supplied strings. Did I understand you right?

  • IR generation resides in a protected package. internal/ygen: this allows the IR to be changed later on as new functionalities are introduced. This also allows for custom passes to easily add things to the IR when they require new information from the yang.Entry structures. I've noticed while making proto generation work that it requires more information than Go generation (due to nesting), and it would be unsurprising if for generation of new languages, that they would also require more information than what would be sufficient for Go and proto.

That sounds OK to me. One question -- if the IR changes, this is solely to add information into the IR, rather than remove it, right? I can't tell whether you're suggesting here that the IR should be considered internal/unstable (such that you need to refactor downstream code generation when this internal package is updated), or whether it's still a stable representation - but other languages would maybe require more of the IR context to generate their contents?

w.r.t your follow up comment, can you define what you mean by "protected" in this context? Particularly, with MappedType. If I understand you right, you're saying that this would not be something that is publicly exposed in the IR -- but then I'm not quite sure where the actual type information would be stored. Do you have a draft of what you're proposing the IR proposal above is adjusted to?

  • Name uniquification doesn't necessarily occur during IR generation, but is resolved in the later language-specific pass. This reduces the complexity for nested proto generation. This doesn't involve a change in the current design, but more of a documentation issue where we should note to IR consumers that whatever language-specific names generated by the IR is not final, and that they can consider extra passes as a good way to reduce their own implementation complexity.

Yes, I think this makes sense. I think name uniquification and mapping probably has to take place later on anyway - since different languages might have different approaches for naming each field and the struct types themselves.

From reloading context from above, does this mean that you're proposing to get rid of the LangMapper interface entirely? I couldn't quite tell. Particularly, if we do this, does this have implications on types - since you still will need to know how to refer to a specific enum type for example in the type field of however you're returning leaves. I think I'm perhaps missing the wider design context here too.

wenovus commented 2 years ago

So a few proposals to reduce the complexity of the design as well as to allow for custom user passes:

  • All code should be put into a template structure before being passed to templates for string generation. This allows custom passes in the per-language generation. These passes go in between the output of gogen/protogen and when the template structures are finally fed to the templates.

Essentially here, you're saying that there should not be methods called from the templates, it should be done as a pre-processing step? I think that this means that we introduce another stage into the diagram at the top of this comment which says "templates" after "lang gen". We would impose the restriction that the "templates" step can only do population of supplied strings. Did I understand you right?

I'm didn't quite understand what you mean by methods. What I'm proposing is that in order to support custom passes that add extra information such as custom tags (https://github.com/openconfig/ygot/issues/522), we could enforce that the entire generated code is placed in a structure suitable for template execution instead of generating parts of the generated piecemeal and then concatenating them together at the end. So for example currently we generate []GoStructCodeSnippet which has StructDef as just a string, so it's useless to the person who wants to add custom tags. If, however, it were generatedGoStruct, then someone can simply add their tag by modifying the Tags field. We would help the user do this by allowing them to access the IR at the same time. So this actually suggests that we should make the IR public instead of protected (i.e. in an internal package) as I proposed.

  • IR generation resides in a protected package. internal/ygen: this allows the IR to be changed later on as new functionalities are introduced. This also allows for custom passes to easily add things to the IR when they require new information from the yang.Entry structures. I've noticed while making proto generation work that it requires more information than Go generation (due to nesting), and it would be unsurprising if for generation of new languages, that they would also require more information than what would be sufficient for Go and proto.

That sounds OK to me. One question -- if the IR changes, this is solely to add information into the IR, rather than remove it, right? I can't tell whether you're suggesting here that the IR should be considered internal/unstable (such that you need to refactor downstream code generation when this internal package is updated), or whether it's still a stable representation - but other languages would maybe require more of the IR context to generate their contents?

w.r.t your follow up comment, can you define what you mean by "protected" in this context? Particularly, with MappedType. If I understand you right, you're saying that this would not be something that is publicly exposed in the IR -- but then I'm not quite sure where the actual type information would be stored. Do you have a draft of what you're proposing the IR proposal above is adjusted to?

I think you're right. I was concerned that even adding information is not backwards-compatible because if would break tests, but omitting those fields during tests would work, since their tests would only supply IR as input, rather than as expected output. Making it public also help custom passes to be more easily done. I will withdraw this item.

  • Name uniquification doesn't necessarily occur during IR generation, but is resolved in the later language-specific pass. This reduces the complexity for nested proto generation. This doesn't involve a change in the current design, but more of a documentation issue where we should note to IR consumers that whatever language-specific names generated by the IR is not final, and that they can consider extra passes as a good way to reduce their own implementation complexity.

Yes, I think this makes sense. I think name uniquification and mapping probably has to take place later on anyway - since different languages might have different approaches for naming each field and the struct types themselves.

From reloading context from above, does this mean that you're proposing to get rid of the LangMapper interface entirely? I couldn't quite tell. Particularly, if we do this, does this have implications on types - since you still will need to know how to refer to a specific enum type for example in the type field of however you're returning leaves. I think I'm perhaps missing the wider design context here too.

No I'm going to keep LangMapper, these are just minor proposals to the existing design.