Open robshakir opened 4 years ago
Looking at this, I think this is actually better to restructure the other way around than I suggested initially. So - we'd internally restructure the code to be able to use the same pattern that ypathgen
uses - such that we use the pattern shown below:
This will mean that the first set of changes restructure GenerateGoCode
and GenerateProtoCode
internally to be able to rely on this pattern - and then once this is done, we then pull out the code that is now language specific into the packages that I suggested above.
There are a few changes here -- particularly, in the Go code generation, we lazily perform generation of types - however, with the changes that @wenovus has made recently, then we're already doing type generation pre-emptively for the enumerated types, and for ypathgen
we do type generation pre-emptively too. There's a big advantage to doing this - which is that we do not then need to carry around the state of how we named a particular enum through the flow of generating the Go code. This perhaps was fine when we were just doing one language, but now seems less useful.
https://github.com/openconfig/ygot/commit/bc8585c6386371a65faf44f88547175e2561211f is a start at this restructuring - it looks fairly large in terms of a change, but I think we're actually doing OK here, and the test coverage is helping.
There are three places in GetDefinitions
(which is GetDirectoriesAndLeafTypes
in the above example) where we're doing something Go-specific today:
newGoGenState
- we can replace this with a generic state implementation I think.findMapPaths
is being called with a bool
set to false
which means the output paths are relative.buildDirectoryDefinitions
is being called with two functions - one being goStructName
, and the other being yangTypeToGoType
- yangTypeToGoType
calls the things that I was suggesting would be in the enumNamer
interface - but the actual implementation of these is common. I'm leaving refactoring this out until a 2nd change (since this one is already quite large) - but it looks like making enumSet be a public thing of its own might help here, so that the yangTypeToLangType
calls can do this on a generic enumSet
with their preferred bool
arguments for naming.This looks a much cleaner change than the one that we would need to make to do things in the order suggested above.
This looks good! I like and agree to the design of having goStructName()
and resolveKeyTypeName()
(the function parameters to buildDirectoryDefinitions
) as part of a configuration passed into GetDefinitions
. The nuance is that the intermediate representation output of the new ygen
still contains language-specific, non-abstract, information (e.g. MappedType
); so, I think of the new ygen
as a step that repackages and renames the YANG information from the raw AST into a language-agnostic structure with language-specific names. Thus, passing these language-specific parameters is part of what the new ygen
does.
What I like most about the refactoring is the clarification of the intermediate representation (AST->IR->code
). The IR is compact and grouped into a single return type (copied below). I'm thinking that our efforts here cannot ensure that the IR will contain whatever a new language can throw at us, but the clarified information makes it easier to see the gaps between what the IR provides, and what a <newlang>gen
requires.
type Definitions struct {
Directories map[string]*Directory
LeafTypes map[string]map[string]*MappedType
Enums map[string]*EnumeratedType
ModelData []*gpb.ModelData
MappedPaths map[string][][]string
ParsedModules []*yang.Entry
}
Thanks for the thoughts here -- as I've been through the few attempts at figuring this out over the last few days, this comment, and one that you made last week have been really useful. I think thinking about this as a compiler problem is really useful -- such that we think about going from concrete YANG input, to an IR, and then to compiled code. Thus far there was a lot of layering violation happening in the code base for this. Thus, I ended up sketching something like this out;
Essentially:
goyang
' should take on the job of taking the input YANG files and creating the parsed/resolved AST that is []*yang.Entry
that we pull into ygen
.ygen
should create the IR that gives us the simplest possible mapping output, which handles all of the complexity of disambiguating YANG, and the framework for resolving types. It should also be entirely responsible for transforming the schema (doing path compression).<lang>Gen
libraries should be responsible for simply walking through these typed outputs, and creating the code in the expected language mechanically. They may choose to add new things in to the code, but these can only be exploiting the details that we expose from ygen.I'm interested in your thoughts and review on the following design decisions that I think came out of the prototyping that's in the various restruct-ygen
branches that I've been working on.
As you say, our IR actually takes into account some of the naming that we do within the code generation. I found that inserting something language specific was relatively easy with this interface:
type LangMapper interface {
// DirectoryName maps an input yang.Entry to the name that should be used in the
// intermediate representation (IR).
DirectoryName(*yang.Entry, genutil.CompressBehaviour) (string, error)
// KeyLeafType maps an input yang.Entry which must represent a leaf to the
// type that should be used when the leaf is used in the context of a
// list key within the output IR.
KeyLeafType(*yang.Entry, genutil.CompressBehaviour) (*MappedType, error)
// LeafType maps an input yang.Entry which must represent a leaf to the
// type that should be used when the leaf is used in the context of a
// field within a directory within the output IR.
LeafType(*yang.Entry, genutil.CompressBehaviour) (*MappedType, error)
// SetEnumSet is used to supply a set of enumerated values to the
// mapper such that leaves that have enumerated types can be looked up.
SetEnumSet(*enumSet)
// SetSchemaTree is used to supply a copy of the YANG schema tree to
// the mapped such that leaves of type leafref can be resolved to
// their target leaves.
SetSchemaTree(*schemaTree)
}
Using this type - any function that produces the IR can use the LangMapper
supplied to resolve those language-specific names that are used in the generated code. The DirectoryName
function can be stateful (and likely needs to be), but that state is maintained on a per LangMapper
implementation (some languages may require global uniqueness, whereas others might have scoping).
I'm proposing to give the LangMapper
some lookup functions too:
enumSet
which is the implementation that you created before. I made a couple of changes - primarily, I modified the implementation to store the state of the different transformation options that it used to generate the names to start with - since these cannot be mutated after the fact, it seemed like this abstracted away the transformation options from the mapper when considering enumerations. Thus, we end up with the following lookup functions
func (e *enumSet) LookupTypedef(yt *yang.YangType, entry *yang.Entry) (*MappedType, error)
func (e *enumSet) LookupEnum(yt *yang.YangType, entry *yang.Entry) (*MappedType, error)
func (e *enumSet) LookupIdentity(i *yang.Identity) (*MappedType, error)
LookupTypedef
has the semantics that it returns nil
if this was not a typedef enum, but the others return the MappedType
as before.
schemaTree
-- this is used to resolve leafref
type leaves ahead of time. It seemed complex to try and pre-resolve these before doing the leaf type generation - so the mapper can use this to look up a specific target leaf.As you observed in one of the commit comments, the enumeration parsing that I'd done went a bit too far insofar as it made the assumption that the only requirement was a map of integer IDs to code names, and then a map of integer IDs to YANG strings. Since some languages might want enumerations to be handled differently based on whether they're global or scoped to a message, this doesn't really make sense. I therefore added an interim representation of a parsed enumeration
or identityref
, particularly:
type EnumeratedValueType int64
const (
_ EnumeratedValueType = iota
SimpleEnumerationType
DerivedEnumerationType
UnionEnumerationType
IdentityType
)
type EnumeratedYANGType struct {
// Name is the name of the generated enumeration.
Name string
// Kind indicates whether the type is an enumeration, or an identityref.
Kind EnumeratedValueType
// ValuePrefix stores any prefix that has been annotated by the IR generation
// that specifies what prefix should be appended to value names within the type.
ValuePrefix []string
// TypeName stores the original YANG type name for the enumeration.
TypeName string
// identityBase is populated if the enumerated type was an Identity.
identityBase *yang.Identity
// enumType is populated if the enumerated type was an enumeration.
enumType *yang.EnumType
}
This representation allows downstream consumers to form their own output code - but removes as much of the complexity as possible from that handling code -- since they can only now deal with whether:
The only reason that we care about these different cases is that the code generation for (e.g., proto) is different for these cases - where the first and third cases are not generated as global enumerations, but rather scoped ones.
Thus far, the IR type that I'm proposing is as follows. Right now, the Go code can be entirely generated from this. I haven't yet restructured the proto generating code, but I think that it can also be generated using this IR too.
type Definitions struct {
// transformationOpts stores details of the schema transformations that were
// made when generating the IR, such that calling code can retrieve these
// elements without needing to be aware of the transformations.
transformationOpts TransformationOpts
// parsedModules caches the set of modules that were parsed for generating
// the IR.
parsedModules []*yang.Entry
// Directories is the set of containers (structs, messages) that should be
// generated from the generated code. The map is keyed by the path to the
// directory.
Directories map[string]*Directory
// LeafTypes stores the type that each leaf has been mapped to in the generated
// code. Directory entries are stored in the map for completeness, but are
// represented as a nil mapped type.
LeafTypes map[string]map[string]*MappedType
// Enums is the set of enumerated entries that are extracted from the generated
// code. The key of the map is the global name of the enumeration.
Enums map[string]*EnumeratedYANGType
// ModelData stores the metadata extracted from the input YANG modules.
ModelData []*gpb.ModelData
// MappedPaths stores the set of paths that each node within the YANG
// tree is mapped to.
MappedPaths map[string][][]string
}
func (d *Definitions) SchemaTree() ([]byte, error) {
rawSchema, err := buildJSONTree(d.parsedModules, d.Directories, d.RootEntry(), d.transformationOpts.CompressBehaviour.CompressEnabled())
if err != nil {
return nil, err
}
return rawSchema, nil
}
func (d *Definitions) RootDirectory() *Directory {
if !d.transformationOpts.GenerateFakeRoot {
return nil
}
v := d.Directories[fmt.Sprintf("/%s", resolveRootName(d.transformationOpts.FakeRootName, defaultRootName, d.transformationOpts.GenerateFakeRoot))]
return v
}
func (d *Definitions) RootEntry() *yang.Entry {
rd := d.RootDirectory()
if rd == nil {
return nil
}
return rd.Entry
}
func (d *Definitions) RootName() string {
rd := d.RootDirectory()
if rd == nil {
return ""
}
return rd.Name
}
The one place that I have some design concerns is the MappedPaths
field here -- it feels like the time that one wants to access this is in the context of outputting a particular leaf -- which makes me feel like LeafTypes
might not be the right return type -- if we rather had something like:
type LeafDetails struct {
Paths [][]string
Type *MappedType
}
then this might feel a bit cleaner. Equally -- rather than have two different Directories
and LeafTypes
outputs, we could rather converge Directories
and LeafTypes
and make the Leaves
within Directory
contain the type above. This seems to have some further implications for ypathgen
though - so might be a step too far.
I pushed to https://github.com/openconfig/ygot/tree/restruct-ygen-interface the diff size for this (including updating tests is +3170, -1732.
To split this out, I think the plan is likely:
LangMapper
implementationLangMapper
implementationGetDirectories
to return the IRgogen
consume the GetDirectories
outputprotogen
consume the GetDirectories
outputIf we agree on this design over the next few days I can start on that direction.
As discussed I agree with the design choices wholly. Using LeafDetails
within Directory
will require some mechanical refactoring in ypathgen
that actually makes its code cleaner, so it's a welcome change.
I agree with using the LangMapper
interface for the input into the new ygen
. A quick look finds that most of ygen
's functionality can be split into 1. Directory
and leaf type generation, and 2. inserting the information into templates. Having ygen
split on this line looks to do a good job of significantly reducing the complexity of the resulting packages.
May want to consider how to fit the IR to allow adding a pluggable callback when generating code: https://github.com/openconfig/ygot/issues/522
@robshakir I'm going to pick this up from your branch at https://github.com/openconfig/ygot/tree/restruct-ygen-interface and https://github.com/openconfig/ygot/tree/ygen-ir-langmapper.
So a few proposals to reduce the complexity of the design as well as to allow for custom user passes:
@robshakir I think the "IR generation residing in an internal package" might be contentious, so I want to raise your attention on this one. In particular currently-exported names like MappedType
will require changes in my current draft of proto generation and will become a protected name under this proposal. I'd argue making this type protected would help with maintainability in the future. It would also set the stage for a potential move of some of the ygot-specific util
definitions and functions into this protected package. For reference here are all of the current exported names: https://pkg.go.dev/github.com/openconfig/ygot/ygen
So a few proposals to reduce the complexity of the design as well as to allow for custom user passes:
- All code should be put into a template structure before being passed to templates for string generation. This allows custom passes in the per-language generation. These passes go in between the output of gogen/protogen and when the template structures are finally fed to the templates.
Essentially here, you're saying that there should not be methods called from the templates, it should be done as a pre-processing step? I think that this means that we introduce another stage into the diagram at the top of this comment which says "templates" after "lang gen". We would impose the restriction that the "templates" step can only do population of supplied strings. Did I understand you right?
- IR generation resides in a protected package. internal/ygen: this allows the IR to be changed later on as new functionalities are introduced. This also allows for custom passes to easily add things to the IR when they require new information from the yang.Entry structures. I've noticed while making proto generation work that it requires more information than Go generation (due to nesting), and it would be unsurprising if for generation of new languages, that they would also require more information than what would be sufficient for Go and proto.
That sounds OK to me. One question -- if the IR changes, this is solely to add information into the IR, rather than remove it, right? I can't tell whether you're suggesting here that the IR should be considered internal/unstable (such that you need to refactor downstream code generation when this internal package is updated), or whether it's still a stable representation - but other languages would maybe require more of the IR context to generate their contents?
w.r.t your follow up comment, can you define what you mean by "protected" in this context? Particularly, with MappedType
. If I understand you right, you're saying that this would not be something that is publicly exposed in the IR -- but then I'm not quite sure where the actual type information would be stored. Do you have a draft of what you're proposing the IR proposal above is adjusted to?
- Name uniquification doesn't necessarily occur during IR generation, but is resolved in the later language-specific pass. This reduces the complexity for nested proto generation. This doesn't involve a change in the current design, but more of a documentation issue where we should note to IR consumers that whatever language-specific names generated by the IR is not final, and that they can consider extra passes as a good way to reduce their own implementation complexity.
Yes, I think this makes sense. I think name uniquification and mapping probably has to take place later on anyway - since different languages might have different approaches for naming each field and the struct types themselves.
From reloading context from above, does this mean that you're proposing to get rid of the LangMapper
interface entirely? I couldn't quite tell. Particularly, if we do this, does this have implications on types - since you still will need to know how to refer to a specific enum
type for example in the type field of however you're returning leaves. I think I'm perhaps missing the wider design context here too.
So a few proposals to reduce the complexity of the design as well as to allow for custom user passes:
- All code should be put into a template structure before being passed to templates for string generation. This allows custom passes in the per-language generation. These passes go in between the output of gogen/protogen and when the template structures are finally fed to the templates.
Essentially here, you're saying that there should not be methods called from the templates, it should be done as a pre-processing step? I think that this means that we introduce another stage into the diagram at the top of this comment which says "templates" after "lang gen". We would impose the restriction that the "templates" step can only do population of supplied strings. Did I understand you right?
I'm didn't quite understand what you mean by methods. What I'm proposing is that in order to support custom passes that add extra information such as custom tags (https://github.com/openconfig/ygot/issues/522), we could enforce that the entire generated code is placed in a structure suitable for template execution instead of generating parts of the generated piecemeal and then concatenating them together at the end. So for example currently we generate
[]GoStructCodeSnippet
which hasStructDef
as just a string, so it's useless to the person who wants to add custom tags. If, however, it weregeneratedGoStruct
, then someone can simply add their tag by modifying theTags
field. We would help the user do this by allowing them to access the IR at the same time. So this actually suggests that we should make the IR public instead of protected (i.e. in an internal package) as I proposed.
- IR generation resides in a protected package. internal/ygen: this allows the IR to be changed later on as new functionalities are introduced. This also allows for custom passes to easily add things to the IR when they require new information from the yang.Entry structures. I've noticed while making proto generation work that it requires more information than Go generation (due to nesting), and it would be unsurprising if for generation of new languages, that they would also require more information than what would be sufficient for Go and proto.
That sounds OK to me. One question -- if the IR changes, this is solely to add information into the IR, rather than remove it, right? I can't tell whether you're suggesting here that the IR should be considered internal/unstable (such that you need to refactor downstream code generation when this internal package is updated), or whether it's still a stable representation - but other languages would maybe require more of the IR context to generate their contents?
w.r.t your follow up comment, can you define what you mean by "protected" in this context? Particularly, with
MappedType
. If I understand you right, you're saying that this would not be something that is publicly exposed in the IR -- but then I'm not quite sure where the actual type information would be stored. Do you have a draft of what you're proposing the IR proposal above is adjusted to?I think you're right. I was concerned that even adding information is not backwards-compatible because if would break tests, but omitting those fields during tests would work, since their tests would only supply IR as input, rather than as expected output. Making it public also help custom passes to be more easily done. I will withdraw this item.
- Name uniquification doesn't necessarily occur during IR generation, but is resolved in the later language-specific pass. This reduces the complexity for nested proto generation. This doesn't involve a change in the current design, but more of a documentation issue where we should note to IR consumers that whatever language-specific names generated by the IR is not final, and that they can consider extra passes as a good way to reduce their own implementation complexity.
Yes, I think this makes sense. I think name uniquification and mapping probably has to take place later on anyway - since different languages might have different approaches for naming each field and the struct types themselves.
From reloading context from above, does this mean that you're proposing to get rid of the
LangMapper
interface entirely? I couldn't quite tell. Particularly, if we do this, does this have implications on types - since you still will need to know how to refer to a specificenum
type for example in the type field of however you're returning leaves. I think I'm perhaps missing the wider design context here too.
No I'm going to keep LangMapper
, these are just minor proposals to the existing design.
ygen
started out as a library that was solely handling generation of Go code, subsequently it has transitioned into being one that generates three different outputs:struct
or a Protobufmessage
.This means that we have a large multi-purpose package that is becoming complex to maintain. Since the only consumers of the public APIs are within our generator binaries, then we have a bit of scope to be able to shuffle this package around - and this should definitely be done prior to a 1.0 release of
ygot
.To this end, I suggest that we restructure this such that we have:
ygen
that performs the base mapping of an input YANG set of files into an abstract representation.ygot/gogen
- a specific package that generates Go code, this should callygen
public APIs to be able to do all the pre-processing and then have the logic to map that into Go.ygot/protogen
- a specific package that generates Proto code, similar toygot/gogen
above.This refactoring should likely be done in a few stages:
GenerateGoCode
to call something close toGenerateDirectoriesAndLeaves
GenerateDirectoriesAndLeaves
language agnostic (see discussion below)GenerateProto3Code
to call something close toGenerateDirectoriesAndLeaves
GenerateGoCode
to its own package.GenerateProto3Code
to its own package.This is a fairly major restructuring, so we'll likely go through some changes to APIs that are incompatible with each other to start with, so we should be careful about figuring out when to cut the next version of ygot - and make sure that this is an RC for 1.0.