impresso / impresso-pycommons

Python module with bits of code (objects, functions) highly reusable within impresso.
http://impresso-pycommons.rtfd.io/
GNU Affero General Public License v3.0
3 stars 3 forks source link

Serialization of rebuilt in UIMA format #28

Closed mromanello closed 5 years ago

mromanello commented 5 years ago

See ne-annotation::issue #3.

Added a basic utility module (impresso_commons.utils.uima) with functions to export from our rebuilt format to Apache UIMA XMI.

Added also a class RebuildDocument to impresso_commons.text as an abstraction layer on top of our JSON serialization.

impresso_commons/data/xmi contains some examples of content items serialized to INCEpTION's format.

e-maud commented 5 years ago

@simon-clematide We discussed with Matteo and here are a few updates that could be integrated in the pull request.

I would vote for (b).

#!/usr/bin/env python
# coding: utf-8 

import datetime
from enum import Enum

class ContentItemCase(Enum):
    FULL = "FULL"  # all info
    TEXT = "TEXT"  # min info + text
    LIGHT = "LIGHT"  # min info

class ContentItem:
    """
    Class which represents an impresso (rebuilt) content item.
    TODO: complement
    :ivar str id: canonical content item id
    :ivar str lg:
    :ivar str type:
    :ivar datetime date:
    :ivar str journal:
    :ivar str s3v:
    :ivar str fulltext:
    :ivar dict text_offsets: pages/regions/paragraphs/lines
    """

    def __init__(self, ci_id, lg, tp):
        """Constructor"""
        self.id = ci_id
        self.lg = lg
        self.type = tp
        self.date = self.build_date(ci_id)
        self.journal = self.build_journal(ci_id)
        self._text_offsets = {}

    @staticmethod
    def build_date(ci_id):
        tmp = ci_id.split("-")
        return datetime.date(int(tmp[1]), int(tmp[2]), int(tmp[3]))

    @staticmethod
    def build_journal(ci_id):
        return ci_id.split("-")[0]

    @property
    def title(self):
        return self.__title

    @title.setter
    def title(self, value):
        self.title = value

    @property
    def lines(self):
        return self.__text_offsets["lines"]

    @lines.setter
    def lines(self, value):
        self.text_offsets["lines"] = value

    @property
    def paragraphs(self):
        return self.__text_offsets["paragraphs"]

    @paragraphs.setter
    def paragraphs(self, value):
        self.text_offsets["paragraphs"] = value

    @property
    def pages(self):
        return self.__text_offsets["pages"]

    @pages.setter
    def pages(self, value):
        self.text_offsets["pages"] = value

    @property
    def regions(self):
        return self.__text_offsets["regions"]

    @regions.setter
    def pages(self, value):
        self.text_offsets["regions"] = value

    @property
    def fulltext(self):
        return self.__fulltext

    @fulltext.setter
    def fulltext(self, value):
        self.fulltext = value

    @staticmethod
    def from_json(path=None, data=None, case=ContentItemCase.LIGHT):
        """Loads an instance of `ContentItem` from a JSON file.
        :param str path: path to a json file
        :param dict data: content item information
        :param enum case: content item configuration via `ContentItemCase` (LIGHT/TEXT/FULL)
        """

        assert data is not None or path is not None
        if data is not None:
            doc = ContentItem(data['id'], data['lg'], data['tp'])
            doc.case = case

            if case == ContentItemCase.TEXT or case == ContentItemCase.FULL:
                doc.__title = data['t'] if 't' in data else None
                doc.__fulltext = data['ft'] if 'ft' in data else None

            if case == ContentItemCase.FULL:
                doc.__lines = data['lb'] if 'lb' in data else None
                doc.__paragraphs = data['pb'] if 'pb' in data else None
                doc.__regions = data['rb'] if 'pb' in data else None
                doc.__pages = data['ppreb'] if 'ppreb' in data else None

            return doc
        elif path is not None:
            return

    def __str__(self):
        s = f'{self.__class__.__name__}:\n\t' \
                f'ci_case={self.case}\n\t' \
                f'ci_id={self.id}\n\t' \
                f'ci_lg={self.lg}\n\t' \
                f'ci_type={self.type}\n\t' \
                f'ci_date={self.date}\n\t' \

        if self.case == ContentItemCase.TEXT \
                or self.case == ContentItemCase.FULL:
            s = s + f'ci_fulltext={self.fulltext}\n\t' \
                   f'ci_title={self.title}\n\t' \

        return s
simon-clematide commented 5 years ago

For me, it would not be very important to have all kind of setters and getters for manipulating simple data properties of the underlying JSON by creating an additional object out of the JSON.

I really like the simplicity and directness of the JSON representation.

I would rather have something like a "ContentItemManipulator" that exposes methods that directly and transparently manipulate or traverse the underlying JSON via convenience abstraction (without building a new object with all JSON properties replicated). The original json would just be an instance variable of the manipulator class.

mromanello commented 5 years ago

I think we have in mind the same thing, just call it differently. When creating the class I was thinking exactly of what you describe

something that exposes methods that directly and transparently manipulate or traverse the underlying JSON via convenience abstraction

for now it mirrors just the properties in the JSON, but it will eventually contain more abstract method, that do not necessarily have 1:1 correspondence with the JSON representation.

If ContentItemManipulator is a more telling name for such a thing, I'm open to any name in principle.