o3de / sig-content

8 stars 12 forks source link

Proposed RFC Feature: Generic DOM #10

Closed NicholasVanSickle closed 3 years ago

NicholasVanSickle commented 3 years ago

Summary:

This proposal outlines the addition of a Generic DOM API, which represents an in-memory representation of a document object model suitable for modeling data stored in forms analagous to JSON or XML documents.

The Generic DOM shall be usable as a drop-in replacement for a rapidjson::GenericDocument but with the added benefits of:

What is the relevance of this feature?

The Generic DOM is intended to align the backend representation and API for several existing and upcoming features that involve DOM manipulation, including:

Aligning on a common API allows shared tooling and serialization support. This is aimed towards avoiding duplication of effort by ensuring serialization and utility methods may be shared between all projects using the generic DOM. Additionally, by decoupling the DOM storage format from the underlying serialization, we gain flexibility in choosing the optimal data format to serialize to, instead of writing a bespoke solution for just a given format.

The Generic DOM aims to shoot for the following tenets:

The Generic DOM is not intended to be a runtime replacement for the binary serialization provided by O3DE's reflection system. Processed assets using O3DE reflection shall still allow for baking assets into an efficient, binary format. The Generic DOM is intended to be used either in tools at authoring time or for deserializing types that come from sources outside of the engine, such as JSON payloads from web services.

Feature design description:

The Generic DOM shall consist of a Visitor / Reader API with several backends, an in-memory Document API, and a patch generation API.

The Visitor / Reader API shall support transforming DOM data from one format to another through a visiting mechanism in which each DOM element is composed into method calls listened to by backend formats that compose a transformation. An arbitrary value may be read from one backend format and written to another backend format. Backends shall be exposed for formats on an as-needed basis.

The initial backends shall include JSON and XML, with a visitor interface for visiting an in-memory Generic DOM as well.

The Document API shall expose a Document that is a handle to a top-level Value. Value is a variant type that stores one of the following types of values:

The Document API shall also expose a Path class that represents a vector of keys to provide a unique path within the DOM, similar to JSON-pointer that refers to a given Value in a Document. Their resolution rules will follow the JSON pointer standard, with a couple caveats:

Serialization shall be handled by external serializer classes for each supported DOM format: JSON and XML to start with. AZStd::any shall not be supported directly for serialization, but a callback may be added to the serializers to attempt to resolve an any into another Value type (using the O3DE reflection system, for instance), if desired.

Serializers are not part of the core Generic DOM API. A registration system to map file formats to "default" format serializers that provide one possible mapping of a DOM to and from a given representation shall be provided, but other serializers may use the Visitor / Reader API to implement arbitrary format support.

JsonBackend shall be responsible for serialization into JSON and deserialziation out of JSON. All values shall use their canonical JSON representations except Nodes, which will be marked as only being supported as Emulated in the backend. Nodes shall be converted into Objects before being serialized. Upon deserialization, Nodes shall be brought in as Objects, but may be losslessly converted back to Nodes.

A node's Emulated object representation is comprised of a JSON object with a single key, its value an array comprised of an object containing all of the Node attributes (an empty object in a case where no attributes are specified), followed by the sequence of the node's children.

Here's an example of a Value JSON serialized using Node types:

{
    "Reflection": [
        {
            "namespace": "LmbrCentral"
        },
        {
            "Include": {
                "path": "AzToolsFramework/ToolsComponents/EditorComponentBase.h",
                "in": "header"
            }
        },
        {
            "Include": {
                "path": "Source/Editor/EditorCommentComponent.reflection.h",
                "in": "body"
            }
        }
    ]
}

Versus its XML representation produced by XML serialization:

<Reflection namespace="LmbrCentral">
    <Include path="AzToolsFramework/ToolsComponents/EditorComponentBase.h" in="header" />
    <Include path="Source/Editor/EditorCommentComponent.reflection.h" in="body" />
</Reflection>

XmlBackend shall use be responsible for serialization into XML and deserialization out of XML. XML shall always serialize its root element as a Node and shall rely upon tags in the o3de namespace to support functionality that doesn't have a 1:1 XML correspondance:

Here's an abridged example of a Value XML serialized using some of these extension types:

<o3de:Document xmlns:o3de="https://o3de.org/">
    <o3de:Container>
        <o3de:MapEntry name="ContainerEntity">
            <o3de:MapEntry name="Id">ContainerEntity</o3de:MapEntry>
            <o3de:MapEntry name="Name">Pawn</o3de:MapEntry>
            <o3de:MapEntry name="Components">
                <o3de:MapEntry name="Component_[11484796288798653912]">
                    <o3de:MapEntry name="$type">EditorInspectorComponent</o3de:MapEntry>
                    <o3de:MapEntry name="Id">11484796288798653912</o3de:MapEntry>
                    <o3de:MapEntry name="ComponentOrderEntryArray">
                        <o3de:ArrayEntry>
                            <o3de:MapEntry name="ComponentId">16987082192253801011</o3de:MapEntry>
                        </o3de:ArrayEntry>
                    </o3de:MapEntry>
            </o3de:MapEntry>
        </o3de:MapEntry>
    </o3de:Container>
</o3de:Document>

Note: The o3de namespace and its corresponding tags are only required for DOMs that don't correspond 1:1 with an XML representation, as expressed above. A Generic DOM hierarchy expressed as Node types with no Objects or Arrays shall not have any of these o3de: tags at all and instead look like XML as dictated by their structure, as seen in the Reflection example above.

Versus its JSON representation:

{
    "ContainerEntity": {
        "Id": "ContainerEntity",
        "Name": "Pawn",
        "Components": {
            "Component_[11484796288798653912]": {
                "$type": "EditorInspectorComponent",
                "Id": 11484796288798653912,
                "ComponentOrderEntryArray": [
                    {
                        "ComponentId": 16987082192253801011
                    }
                ]
            }
        }
    }
}

Technical design description:

The visitor API, modeled after rapidjson's SAX streaming, allows consumers to adapt from a VisitorInterface to produce a DOM from a given backend. These visitors can be used to read data and be composed to move data into another format, be it an in-memory representation or an on-disk one.

Visitor API:

struct VisitorInterface
{
    // All methods return true or false dependent on whether an operation was succesful.
    virtual bool Null();
    virtual bool Bool(bool);
    virtual bool Int(int64_t);
    virtual bool Uint(uint64_t);
    virtual bool Double(double);
    virtual bool String(const char* string, size_t length, bool copy);

    // The opaque type is not designed to be serialized and will be rejected by standard serializers
    virtual bool OpaqueValue(const AZStd::any& value) { return false; }

    virtual bool StartNode(const char* name, size_t length, bool copy);
    virtual bool EndNode(size_t attributeCount, size_t elementCount);

    virtual bool StartObject();
    virtual bool EndObject(size_t attributeCount);
    virtual bool Key(const char*, size_t length, bool copy);

    virtual bool StartArray();
    virtual bool EndArray(size_t elementCount);
};

//! Backends are registered centrally and used to transition DOM formats to and from a given file.
//! Other formats (such as direct in-memory representations) may be simply implemented as helper functions that
//! process data and return any intermediate value required.
class DomBackend
{
    //! Attempt to read this format from the given file into the target VisitorInterface.
    bool ReadFromPath(AZ::IO::PathView pathName, VisitorInterface* visitor);
    //! Attempt to read this format from the given stream into the target VisitorInterface.
    virtual bool ReadFromBuffer(const char* buffer, size_t length, bool copyStrings) = 0;
    //! Acquire a visitor interface for writing to the target output file.
    AZStd::unique_ptr<VisitorInterface> CreatePathWriter(AZ::IO::PathView pathName);
    //! Acquire a visitor interface for writing to the target output buffer.
    virtual AZStd::unique_ptr<VisitorInterface> CreatePathWriter(AZStd::vector<char>& outputBuffer) = 0;

    enum class TypeSupportLevel
    {
        //! Unsupported types shall be rejected by this backend.
        Unsupported,
        //! Native types can be serialized losslessly into this backend format.
        Native,
        //! Emulated types can be serialized into this backend format but they may be transformed
        //! into an intermediate representation first.
        Emulated,
    }

    //! Returns the extent to which this backend format supports a given type.
    //! Unsupported types may not be used and shall not be emitted from this backend.
    virtual TypeSupportLevel GetTypeSupportLevel(AZ::DOM::Type type) = 0;
};

//! Central interface for registering and looking up backend types by file extension.
class BackendRegistry
{
public:
    //! Registers a factory for a given backend.
    template <class TBackend>
    void RegisterBackend(AZ::Name name, AZStd::vector<AZStd::string> extensions);

    //! Looks up a DOM backend based on its name.
    DomBackend* GetBackendByName(AZ::Name name);

    //! Looks up a DOM backend based on a file extension.
    DomBackend* GetBackendForExtension(AZStd::string_view extension);

private:
    // <internal registration helpers here>
};

The Value and Document API is meant to be as close as possible to a drop-in replacement for rapidjson, with a couple important extnesions:

Value API:

// Currently we're storing all keys as string types
// For JSON-style data, it may be desirable to also support numeric indexing
using KeyType = AZ::Name;

// Type entries match rapidjson, excepting kNodeType and kOpaqueType
enum class Type
{
    kNullType = 0,
    kFalseType = 1,
    kTrueType = 2,
    kObjectType = 3,
    kArrayType = 4,
    kStringType = 5,
    kNumberType = 6,
    kNodeType = 7,
    kOpaqueType = 8,
};

class Value
{
    using string = AZStd::string;
    using string_view = AZStd::string_view;

public:
    // Constructors...
    explicit Value();
    explicit Value(const Value&);
    explicit Value(Value&&) noexcept;
    Value(string_view string, bool copy=false);

    explicit Value(int32_t value);
    explicit Value(int64_t value);
    explicit Value(uint32_t value);
    explicit Value(uint64_t value);
    explicit Value(double value);
    explicit Value(bool value);

    static Value FromOpaqueValue(const AZStd::any& value);

    // Equality / comparison / swap...
    Value& operator=(const Value&);
    Value& operator=(Value&&) noexcept;

    bool operator==(const Value& rhs) const;
    bool operator!=(const Value& rhs) const;

    void Swap(Value& other) noexcept;

    // Type info...
    Type GetType() const;
    bool IsNull() const;
    bool IsFalse() const;
    bool IsTrue() const;
    bool IsBool() const;
    bool IsNode() const;
    bool IsObject() const;
    bool IsArray() const;
    bool IsOpaqueValue() const;
    bool IsNumber();
    bool IsInt() const;
    bool IsUint() const;
    bool IsDouble() const;
    bool IsString() const;

    // Object API (also used by Node)...
    Value& SetObject();
    size_t MemberCount() const;
    size_t MemberCapacity() const;
    bool ObjectEmpty() const;

    Value& operator[](KeyType name);
    const Value& operator[](KeyType name) const;
    Value& operator[](string_view name);
    const Value& operator[](string_view name) const;

    ConstMemberIterator MemberBegin() const;
    ConstMemberIterator MemberEnd() const;
    MemberIterator MemberBegin();
    MemberIterator MemberEnd();

    ConstMemberIterator FindMember(KeyType name) const;
    ConstMemberIterator FindMember(string_view name) const;
    MemberIterator FindMember(KeyType name);
    MemberIterator FindMember(string_view name);

    Value& MemberReserve(size_t newCapacity);
    bool HasMember(KeyType name);
    bool HasMember(string_view name);

    Value& AddMember(KeyType name, const Value& value);
    Value& AddMember(string_view name, const Value& value);
    Value& AddMember(KeyType name, Value&& value);
    Value& AddMember(string_view name, Value&& value);

    void RemoveAllMembers();
    void RemoveMember(KeyType name);
    void RemoveMember(string_view name);
    MemberIterator RemoveMember(MemberIterator pos);
    MemberIterator EraseMember(ConstMemberIterator pos);
    MemberIterator EraseMember(ConstMemberIterator first, ConstMemberIterator last);
    MemberIterator EraseMember(KeyType name);
    MemberIterator EraseMember(string_view name);

    // Array API (also used by Node)...
    Value& SetArray();

    size_t Size() const;
    size_t Capacity() const;
    bool Empty() const;
    void Clear();

    Value& operator[](size_t index);
    const Value& operator[](size_t index) const;

    ConstValueIterator Begin() const;
    ConstValueIterator End() const;
    ValueIterator Begin();
    ValueIterator End();

    Value& Reserve(size_t newCapacity);
    Value& PushBack(Value& value);
    Value& PushBack(Value&& value);
    Value& PopBack();

    ValueIterator Erase(ConstValueIterator pos);
    ValueIterator Erase(ConstValueIterator first, ConstValueIterator last);

    Array GetArray();
    const Array GetArray() const;

    // Node API (supports both object + array API, plus a dedicated NodeName)...
    bool CanConvertToNodeFromObject() const;
    Value& ConvertToNodeFromObject();
    Value& ConvertToObjectFromNode();

    void SetNode(AZ::Name name);
    void SetNode(string_view name);

    AZ::Name GetNodeName() const;
    void SetNodeName(AZ::Name name);
    void SetNodeName(string_view name);
    //! Convenience method, sets the first non-node element of a Node.
    void SetNodeValue(Value value);
    //! Convenience method, gets the first non-node element of a Node.
    Value GetNodeValue() const;

    // int API...
    int64_t GetInt() const;
    void SetInt(int64_t);

    // uint API...
    uint64_t GetUint() const;
    void SetUint(uint64_t);

    // bool API...
    bool GetBool() const;
    void SetBool(bool);

    // double API...
    double GetDouble() const;
    void SetDouble(double);

    // string API...
    string_view GetString() const;
    size_t GetStringLength() const;
    void SetString(string_view);
    void CopyFromString(string_view);

    // opaque type API...
    const AZStd::any& GetOpaqueValue()) const;
    //! This sets this Value to represent a value of an type that the DOM has
    //! no formal knowledge of. Where possible, it should be preferred to
    //! serialize an opaque type into a DOM value instead, as serializers
    //! and other systems will have no means of dealing with fully arbitrary
    //! values.
    void SetOpaqueValue(const AZStd::any&);

    // null API...
    void SetNull();

private:
    // There are a couple options for internal storage:
    // If heap allocating for all extra storage types (node/object/array/any)
    // this can be constrained to 24 bytes and use copy on write to make copy/compare
    // very cheap.

    // Alternatively we can expand to sizeof(AZStd::any) which is a whopping 128 bytes at the
    // time this was written and avoid the indirection / thraed safety issues of copy-on-write.
    // Short string optimization can be implemented with either approach.

    // If using the the copy on write model, anything stored internally as a shared_ptr will
    // detach and copy when doing a mutating operation if use_count() > 1.

    // This internal storage will not have a 1:1 mapping to the public Type, as there may be
    // multiple storage options (e.g. strings being stored as non-owning string_view or
    // owning shared_ptr<string>)
    using ValueType = AZStd::variant<
        AZStd::monostate,
        int32_t,
        int64_t,
        uint32_t,
        uint64_t,
        float,
        double,
        bool,
        string_view,
        AZStd::shared_ptr<string>,
        AZStd::shared_ptr<NodeType>,
        AZStd::shared_ptr<ObjectType>,
        AZStd::shared_ptr<ArrayType>,
        AZStd::shared_ptr<AZStd::any>
    >;
};

The GenericDocument type shall inherit from Value and provide an allocator instance that can be passed through as needed.

The Path type shall be API-identical to rapidjson, but respect the underlying value of KeyType (AZ::Name).

The Patch API will provide the ability to generate a set of patches in the JSON-patch format by doing a hierarchical comparison between two DOM values and generating a set of GenericPatchOperations to apply to acquire the new state from the previous state. The initial algorithm will not be isomorphic, i.e. it will not detect objects or nodes being reparented within the hierarchy and reduce them to Move operations, in the interest of keeping its initial implementation straightforward and bounded to linear time.

The patch API can be used as a basis to implement the Command Pattern for undo/redo support. Patches may be generated by the GenericPatch::GenerateDeltaPatch API or by manually creating a sequence of GenericPatchOperations.

Patch API:

struct GenericPatchOperation
{
    enum class Type
    {
        Add,
        Remove,
        Replace,
        Copy,
        Move,
        Test
    };

    explicit GenericPathOperation();
    explicit GenericPatchOperation(Path, Type, Value);

    Path m_domPath;
    Type m_operation;
    Value m_value;

    bool Apply(Value& rootElement);
};

struct GenericPatchInfo;

struct GenericPatch
{
    AZStd::vector<GenericPatchOperation> m_operations;

    bool Apply(Value& rootElement);
    Value GetDomRepresentation() const;
    GenericPatch CreateFromDomRepresentation(Value);
};

//! Generates a set of patches such that m_forwardPatches.Apply(beforeState) shall
//! produce a document equivalent to afterState, and a subsequent m_inversePatches.Apply(beforeState)
//! shall produce the original document.
//! This patch generation strategy does a hierarchical comparison and is not
//! guaranteed to create the minimal set of patches required to transform
//! between the two states.
PatchInfo GenerateHierarchicalDeltaPatch(const Value& beforeState, const Value& afterState);

//! A set of patches for applying a change and doing the inverse operation (i.e. undoing it).
struct GenericPatchInfo
{
    GenericPatch m_forwardPatches;
    GenericPatch m_inversePatches;
};

JsonBackend shall use rapidjson's SAX interface to support efficient conversion to and from JSON without intermediate allocations. Serialization can be done recursively using a rapidjson::Writer or rapidjson::PrettyWriter while deserialization can be done with a rapidjson::BaseReaderHandler instance and a rapidjson::Reader, both of which shall forward in a straightforward manner to the visitor API.

XmlBackend would benefit from bringing in an XML library that supports streaming. Presently, O3DE ships with rapidxml, which strictly works by allocating an in-memory XML representation that needs to be translated; this worked in prototyping, but may introduce a performance burden. Expat is a streaming XML library written in C with an MIT license that appears to be suitable, but it has not yet been formally evaluated.

What metrics will need to be collected for this feature?

As several systems will be built on top of the Generic DOM, safeguards shall be in place to validate performance for highlighted use cases. These shall include, but not be limited to, the following:

What are the advantages of the feature?

What are the disadvantages of the feature?

How will this be implemented or integrated into the O3DE environment?

The implementation shall require an up-front API implementation, followed by targeted system expansion as desired.

Are there any alternatives to this feature?

Monolithically aligning on using exlucisely JSON serialization is also an option. Prefabs are built heavily around rapidjson already, and JSON is a generally robust format that can be used to serializable arbitrary DOMs. However, this locks us into exclusively using JSON, preventing us from using XML, YAML, or other formats should we wish to use them in the future e.g. storing user settings in YAML to make hand-editing easier. This also precludes convenience features by way of the Node and Any types.

How will users learn this feature?

This is an API-level change that is intended to be minimally disruptive. If and when O3DE's JSON serialization gets replaced by the Generic DOM API, it should just entail replacing rapidjson::Value usage with a corresponding AzCore::DOM::Value.

Are there any open questions?

alexmontAmazon commented 3 years ago

This RFC has been accepted by sig-content and is now in its final 5 day comment period. If there are no open issues or discussion at that point, the comment period will close on 9/28/21.