Closed mromanello closed 5 years ago
@simon-clematide We discussed with Matteo and here are a few updates that could be integrated in the pull request.
I would vote for (b).
RebuiltDocument
to ContentItem
.
Below a class for ContentItem which integrates the RebuiltDocument of the present request:
Main characteristics:
from_json
which allow to load more or less information (LIGHT/TEXT/FULL) in order to serve different needs/usage of the class, without necessarily carry on unnecessarily heavy objects.
This configuration can be refined.#!/usr/bin/env python
# coding: utf-8
import datetime
from enum import Enum
class ContentItemCase(Enum):
FULL = "FULL" # all info
TEXT = "TEXT" # min info + text
LIGHT = "LIGHT" # min info
class ContentItem:
"""
Class which represents an impresso (rebuilt) content item.
TODO: complement
:ivar str id: canonical content item id
:ivar str lg:
:ivar str type:
:ivar datetime date:
:ivar str journal:
:ivar str s3v:
:ivar str fulltext:
:ivar dict text_offsets: pages/regions/paragraphs/lines
"""
def __init__(self, ci_id, lg, tp):
"""Constructor"""
self.id = ci_id
self.lg = lg
self.type = tp
self.date = self.build_date(ci_id)
self.journal = self.build_journal(ci_id)
self._text_offsets = {}
@staticmethod
def build_date(ci_id):
tmp = ci_id.split("-")
return datetime.date(int(tmp[1]), int(tmp[2]), int(tmp[3]))
@staticmethod
def build_journal(ci_id):
return ci_id.split("-")[0]
@property
def title(self):
return self.__title
@title.setter
def title(self, value):
self.title = value
@property
def lines(self):
return self.__text_offsets["lines"]
@lines.setter
def lines(self, value):
self.text_offsets["lines"] = value
@property
def paragraphs(self):
return self.__text_offsets["paragraphs"]
@paragraphs.setter
def paragraphs(self, value):
self.text_offsets["paragraphs"] = value
@property
def pages(self):
return self.__text_offsets["pages"]
@pages.setter
def pages(self, value):
self.text_offsets["pages"] = value
@property
def regions(self):
return self.__text_offsets["regions"]
@regions.setter
def pages(self, value):
self.text_offsets["regions"] = value
@property
def fulltext(self):
return self.__fulltext
@fulltext.setter
def fulltext(self, value):
self.fulltext = value
@staticmethod
def from_json(path=None, data=None, case=ContentItemCase.LIGHT):
"""Loads an instance of `ContentItem` from a JSON file.
:param str path: path to a json file
:param dict data: content item information
:param enum case: content item configuration via `ContentItemCase` (LIGHT/TEXT/FULL)
"""
assert data is not None or path is not None
if data is not None:
doc = ContentItem(data['id'], data['lg'], data['tp'])
doc.case = case
if case == ContentItemCase.TEXT or case == ContentItemCase.FULL:
doc.__title = data['t'] if 't' in data else None
doc.__fulltext = data['ft'] if 'ft' in data else None
if case == ContentItemCase.FULL:
doc.__lines = data['lb'] if 'lb' in data else None
doc.__paragraphs = data['pb'] if 'pb' in data else None
doc.__regions = data['rb'] if 'pb' in data else None
doc.__pages = data['ppreb'] if 'ppreb' in data else None
return doc
elif path is not None:
return
def __str__(self):
s = f'{self.__class__.__name__}:\n\t' \
f'ci_case={self.case}\n\t' \
f'ci_id={self.id}\n\t' \
f'ci_lg={self.lg}\n\t' \
f'ci_type={self.type}\n\t' \
f'ci_date={self.date}\n\t' \
if self.case == ContentItemCase.TEXT \
or self.case == ContentItemCase.FULL:
s = s + f'ci_fulltext={self.fulltext}\n\t' \
f'ci_title={self.title}\n\t' \
return s
For me, it would not be very important to have all kind of setters and getters for manipulating simple data properties of the underlying JSON by creating an additional object out of the JSON.
I really like the simplicity and directness of the JSON representation.
I would rather have something like a "ContentItemManipulator" that exposes methods that directly and transparently manipulate or traverse the underlying JSON via convenience abstraction (without building a new object with all JSON properties replicated). The original json would just be an instance variable of the manipulator class.
I think we have in mind the same thing, just call it differently. When creating the class I was thinking exactly of what you describe
something that exposes methods that directly and transparently manipulate or traverse the underlying JSON via convenience abstraction
for now it mirrors just the properties in the JSON, but it will eventually contain more abstract method, that do not necessarily have 1:1 correspondence with the JSON representation.
If ContentItemManipulator
is a more telling name for such a thing, I'm open to any name in principle.
See ne-annotation::issue #3.
Added a basic utility module (
impresso_commons.utils.uima
) with functions to export from our rebuilt format to Apache UIMA XMI.Added also a class
RebuildDocument
toimpresso_commons.text
as an abstraction layer on top of our JSON serialization.impresso_commons/data/xmi
contains some examples of content items serialized to INCEpTION's format.