Refactor serialiser/deserializer interface

Problem

The current serialiser interface has a couple of issues related to clear responsibilities/naming:

JSONSerializer and MarshmallowSerialzer for instance both inherit BaseSerializer but JSONSerializer is injected
Some serializer output bytes other str, this means lack of consistency in if you have to call encode/decode (which later means serializers/deserializer can't easily be interchanged)
Deserialiser uses mixins and wasn't changed when the serializer mixin was changed to a base class.

Design

Serializer A serializer at a conceptual level takes as input an result item and outputs one of 1) bytes, 2) str or 3) etree. A deserialiser should conceptually do the reverse.

A serializer is composed of:

A transformation step - takes as input a result item dict, and outputs a transformed result item dict
A formatting step - takes a as input a transformed result item dict, and outputs a one of the three

class BaseSerializer:
    def __init__(self, formatter=...)

    def serialize_bytes(self, item):
        self.formatter.to_bytes(self.dump_obj(item))

    def serialize_bytes_list(self, item_list):
        self.formatter.to_bytes_list(self.dump_ob_list(item_list))

    def serialize_str(self, item):
        self.formatter.to_str(self.dump_obj(item))

    def serialize_str_list(self, item_list):
        self.formatter.to_bytes_list(self.dump_ob_list(item_list))

    def serialize_etree(self, item):
        self.formatter.to_etree(self.dump_obj(item))

    def serialize_etree_list(self, item_list):
        self.formatter.to_etree_list(self.dump_obj_list(item_list))

    def dump_obj(self, list):
        # Same as MarshmallowSerializer today

    def dump_obj_list(self, list):
        # Same as MarshmallowSerializer today

A deserialiser should conceptually do the same in reverse:

class BaseDeserializer:
    def deserialize_bytes(self, data_bytes):
        # ...

    def deserialize_bytes_list(self, data_bytes):
        # ...

    def deserialize_str(self, data_str):
        # ...

    def deserialize_str_list(self, data_str):
        # ...

    def deserializer_etree(self, data_etree):
        # ...skip this one for now as I don't think we use it - i.e. raise NotImplementedError

    def deserializer_etree_list(self, data_etree):
        # ...

    def load_obj(self, dumped_obj):
        # ...

    def load_obj_list(self, dumped_obj_list):
        # ...

Transformer

The serializer could for now do the transform step (step 1) itself. If we need something else than Marshmallow (say e.g. dojson) for transformation we could consider extracted as a dedicated object similar to the formatter below. The transformer is conceptually the two methods (dump_obj/load_obj).

Formatter

A serializer should delegate the formatting step to a dedicated object responsible to step 2. Formatters are responsible for taking the transformed result item and output bytes, str or etree (if possible for the formatter).

class Formatter:
    def to_bytes(self, dumped_item):
        raise NotImplementedError

    def to_bytes_list(self, dumped_item_list):
        raise NotImplementedError

    def to_str(self, dumped_item):
        raise NotImplementedError

    def to_str_list(self, dumped_item):
        raise NotImplementedError

    def to_etree(self, dumped_item):
        raise NotImplementedError

    def to_etree_list(self, dumped_item):
        raise NotImplementedError

The we should likely have two formatters:

class JSONFormatter:
    def _encoder()
        #...

    def to_bytes(self, dumped_item):
        return self.to_str(dumped_item).encode('utf8')

    def to_bytes_list(self, dumped_item_list):
        # ...

    def to_str(self, dumped_item):
        return json.dumps(dumped_item, self._encoder(), ...)

    def to_str_list(self, dumped_item):
        raise NotImplementedError

class LXMLFormatter:
    def __init__(self, etree_dumper=..., etree_options={...}):
        #...

    def to_bytes(self, dumped_item):
        return self.to_str(dumped_item).encode('utf8')

    def to_bytes_list(self, dumped_item_list):
        # ...

    def to_str(self, dumped_item):
        return json.dumps(dumped_item, self._encoder(), ...)

    def to_str_list(self, dumped_item):
        raise NotImplementedError

Usage

You can now compose new serializers:

datacite43json = Serializer(
    object_schema_cls=DataCite43Schema,
    list_schema_cls=BaseListSchema,
    formatter=JSONFormatter()
)
datacite43xml = Serializer(
    object_schema_cls=DataCite43Schema,
    list_schema_cls=None,
    formatter=LXMLFormatter(
        dump_etree=schema43.dump_etree, 
        etree_options=dict(pretty_print=True, xml_declaration=True, encoding='utf-8')
    )
)
dublincorexml = Serializer(
    object_schema_cls=DublinCoreSchema,
    list_schema_cls=None,
    formatter=LXMLFormatter(
        dump_etree=simpledc.dump_etree, 
    )
)
dublincorexml = Serializer(
    object_schema_cls=DublinCoreSchema,
    list_schema_cls=None,
    formatter=LXMLFormatter(dump_etree=simpledc.dump_etree)
)
dcatxml = Serializer(
    object_schema_cls=DataCite43Schema,
    list_schema_cls=None,
    formatter=LXMLFormatter(
        dump_etree=apply_xslt(
            schema43.dump_etree, 
            "invenio_rdm_records.resources.serializers", 
            "dcat/datacite-to-dcat-ap.xsl")
        )
    )
)
from dojson.contrib.to_marc21.utils import dumps_etree
marcxml = Serializer(
    object_schema_cls=MARCXMLSchema,
    list_schema_cls=None,
    formatter=LXMLFormatter(
        dump_etree=dumps_etree
    )
)
bibtex = Serializer(
    object_schema_cls=BibTeXSchema,
    list_schema_cls=None,
    formatter=BibTeXFormatter()
)
geojson = Serializer(
    object_schema_cls=GeoJSONSchema,
    formatter=JSONFormatter()
)
csljson = Serializer(
    object_schema_cls=CSLJSONSchema,
    formatter=JSONFormatter()
)
citationstr = Serializer(
    object_schema_cls=CSLJSONSchema,
    formatter=CitationStringFormatter(url_args_retriever=...)
)

You can now compose new deserializers as well:

rocrate = Deserializer(
    schema=ROCrateSchema,
)

Additional issues

[ ] XML formats with list results. Only MARCXML can produce an XML list result, but most other formats cannot. DataCite XML for instance does not have an XML schema to validate list result (OAI DataCite XML might have a schema for list results.
[ ] Documentation: https://inveniordm.docs.cern.ch/develop/topics/serializers/ is out of date with old and new content. As a result of this task, the documentation should be outdated so it's consistent with the code.
[ ] Fix usage of serializers in the OAIServer. E.g. now they can directly dump an etree if the formatter supports it return DublinCoreXMLSerializer.serialize_etree(item)

Unresolved questions

Perhaps Formatter should be renamed to Writer so that the deserializer/serializer has consistent naming.
- Serialize: Item -> Dumper (Marshmallow) -> Writer (XML, JSON, BibTex, ...) -> Data
- Deserialize: Data -> Reader (XML, JSON, BibTex, ...) -> Loader (Marshmallow) -> Item

inveniosoftware / flask-resources

Refactor serialiser/deserializer interface #117

Problem

Design