RFC formatter: support only jinja formatting + heavy API refactoring (principles)

lnielsen commented 9 years ago

While this is not the most urgent issue, I'd like a general opinion on if this is the way to go in the future. The RFC would have to be fleshed out to be more specific, so this is just the about agreeing on some principles.

Problem: Currently we support many different formatting engines with lots of legacy code which makes maintenance hard. The current code is mixing responsibilities - e.g. formatting of a record representation, fetching of that record, caching, debug, which also makes maintenance and replacement of individual parts hard. New code written for overlays is in danger of depending too much on legacy code (using e..g existing bfe_elements), thus making transition harder in the future.

Proposal:

[x] Support only Jinja based formatting. Meaning bfe elements, XSLT transformation (which anyway requires MARCXML) and bfx will disappear.
[x] Rebuild module from scratch with a clean API. I.e. de-scope module to only deal with formatting of a record representation (e.g. format_record should not take a recid, but say a JSON representation instead). Create separate modules to deal with caching, etc.
[x] Decide on responsibilities between records and formatting module (i imagine records is incharge of transformation, while formatter takes a transformation and renders it).
[x] Existing data model with format template and output formats will be kept.
[ ] Build legacy layer that implements most important methods from old api, using the new modules. Likely perhaps even the old engines could be supported if specifically enabled. Legacy layer must be completely separate from new module.

P.S Don't shoot me :-)

Issues raised in comments:

Dealing with dynamic data.
Engines concept useful to abstract underlying technology.
Transformation between data models (MARC->DC, DC->DataCite, ...)
Need for high level API responsible for orchestration of small isolated modules.

jalavik commented 9 years ago

As we in INSPIRE are going very soon have to start the work on re-factoring/re-designing the formats on our Invenio v.2.0 site it is nice to discuss this already - to be sure we are going in the right direction.

I think the proposal looks good so far :+1:

tiborsimko commented 9 years ago

The general direction is good, especially concerning "regular" output formats. No comments there. For example, in the case of CERN Open Data Portal instance, we are already using Jinja/JSON based formatting way whenever possible.

Here are some more comments though, especially for "non-regular" outputs.

(1) More nuance is needed concerning "support only Jinja based formatting" and "removing XSLT" transformations. While our records' internal representation is JSON, the real bibliographic information of our records, or their master format, is either MARC, or Dublin Core, or EAD, or UNIMARC, or JSON, or whatever each concrete Invenio instance uses. (This depends on collections; one can have this-and-this master format for photos, and that-and-that master format for archival records, and even-yet-another master format for books.) Forcing format conversions, such as MARC-to-DublinCore or EAD-to-DublinCore, to always go via JSON, is not practical, because there are already available bibliographic conversion style sheets, say, that people can simply take and plug. See my recent comment https://github.com/inveniosoftware/invenio/issues/2075#issuecomment-69216423 in another thread with graphical representation of these two different formatting channels.

And, while I agree with the RFC for the use case of the first channel, we need to address also the need for the other channel.

In one way or another, one should be able to call some high-level API "to format this record in this format", which would sometimes spit out master format directly, sometimes spit out pre-cached format from "dissemination information store", sometimes use JSON/Jinja live to produce the wanted output, but sometimes also call XSLT or some other bibliographic library (such as pybliographer heavily used at EPFL) that will produce desired output format out of the native master format in the best possible way.

One can address my remark in several ways. E.g. the current BibFormat module contained several "engine components" that permitted to use more than one way of formatting the records to produce the wanted format. E.g. one can keep "formatter" to server only JSON-based formatting use cases, and invent a new module, say "exporter" or "disseminator", that would take care of creating/serving various wanted formats out of native ones independently. Whatever the way, we need to address and offer such a non-JSON-based formatting option in one way or another. Hence it should be elaborated as part of this RFC.

(2) Formatting pure JSON representation of a record may not be always enough when we need to show information from other records to generate the wanted output format. E.g. we may want to search and include information about 312 conference contributions records when formatting some conference proceedings record; this is fully dynamic. E.g. we may want to include the citation counts when showing references of a paper. Etc. Hence the formatter may want to use live search engine in these cases, as it would not be practical to represent all the various "linked information" as derived JSON fields for all the formatting needs. You probably thought of that, but we need to spell it out and think of what can be derived in JSON and what can be better done as live querying.

(3) Another related thing to consider is formatting more than one record, or current format_records() and our various summary formats such as citation summary. It seems obvious to separate these out (and summary formats were always separated), but it may good to muse about "collection formatting" or "summary formatting" as part of the same formatting RFC as well, perhaps. (Note the current separation of summary formats in WebSearch/BibRank, which is not ideal. Could be another component/module that uses search and rank as client, for example.)

P.S. Why is this p_critical? Note the current ticket priority rules that say:

p_blocker: highest priority, e.g. breaks home page
p_critical: higher priority, e.g. test case broken
p_major: normal priority, used for most tickets
p_minor: less priority, less visible user impact
p_trivial: lowest priority, cosmetics

lnielsen commented 9 years ago

Thanks for the detailed response and raised points :-)

P.S. Why is this p_critical?

My mistake. I've fixed it now.

Response in reverse order :-)

(3) Another related thing to consider is formatting more than one record.

Yes, I completely agree that this needs to be taken into account since it affects the output (the obvious case is MARCXML <collection>-tag).

(2) Formatting pure JSON representation of a record may not be always enough when we need to show information from other records to generate the wanted output format.

I understand the problem you are describing of including dynamic data in the formats. In my opinion the dynamic data should be computed prior to the formatting (or perhaps rather call it rendering). Hence, the input to the formatter is some representation that contains all the information needed to render the output. I.e. querying for 312 conference proceedings should be done in an intermediate step prior to rendering the output. In my opinion this step should not happen inside the formatting.

(1) More nuance is needed concerning "support only Jinja based formatting" and "removing XSLT" transformations.

Yes, it's perhaps a bit too aggressive formulation and perhaps we want to replace Jinja in the future, so keeping the concept of engines makes sense to me, as long as these engines are properly isolated from one another.

E.g. one can keep "formatter" to server only JSON-based formatting use cases, and invent a new module, say "exporter" or "disseminator", that would take care of creating/serving various wanted formats out of native ones independently.

I think I agree with this approach. Most of the cases you list I think is a transformation between different data models using existing defined transformations. Separating this out I think will likely make the code cleaner.

In one way or another, one should be able to call some high-level API "to format this record in this format", which would sometimes spit out master format directly, sometimes spit out pre-cached format from "dissemination information store"

Yes, I do agree that we need high-level APIs that orchestrates rendering/formatting of records. My main issue with the current formatter is that its orchestration is not made up of small isolated modules that only does a very well defined job. E.g. caching and formatting is very intertwined, and there's all sorts of performance hacks. This makes maintenance error prone and very hard, you'll have a hard time moving e.g. caching to redis instead of bibfmt table.

I'll update the RFC with a list of the issues raised.

jirikuncar commented 9 years ago

Dealing with dynamic data.

Only using calculated record fields.

Engines concept useful to abstract underlying technology.

Most of it can be done simply like:

Example formatter/baz.xml:

{%- cached record['_id'] -%}
{% if record.additional_info['master_format'] == 'marc' %}
{{ record.legacy_export_as_marc() | xslt(pkg_resources.get_filename('foo.bar', 'baz.xslt')) }}
{% else %}
<foo>{{ record['title.title'] }}</foo>
{% endif %}
{%- endcached -%}

jmartinm commented 9 years ago

Thanks a lot for starting the discussion @lnielsen . Good points raised so far. As @jalavik mentioned, we are going to be working on this in the following months so happy to help out.

jirikuncar commented 9 years ago

[x] deprecate bfe_ elements and bfo objects.

jirikuncar commented 9 years ago

Proposal: Output format definition in YAML/JSON/Python

The idea is to merge database table format and *.bfo files into one file per output format using some standard format/markup. See an example for detailed output format.

YAML

description: 'HTML detailed output format, used for Detailed record pages.'
content_type: text/html
mime_type: application/x-foo
visibility: 1

rules:
- "tag 980.a":
  - PICTURE: Picture_HTML_detailed.tpl
  - POETRY: Poetry_HTML_detailed.tpl

- "tag 773.t":
  - "Atlantis Times": Journal_HTML_detailed.bft

default: Default_HTML_detailed.tpl

Option 2 for rules:

rules:
- tag: 980
  value: PICTURE
  template: Picture_HTML_detailed.tpl
- field: _collections
  value: POETRY
  template: Poetry_HTML_detailed.tp

no extra parsing is required

JSON

{
  "rules": [
    {
      "tag 980.a": [
        {
          "PICTURE": "Picture_HTML_detailed.tpl"
        }, 
        {
          "POETRY": "Poetry_HTML_detailed.tpl"
        }
      ]
    }, 
    {
      "tag 773.t": [
        {
          "Atlantis Times": "Journal_HTML_detailed.bft"
        }
      ]
    }
  ], 
  "default": "Default_HTML_detailed.tpl", 
  "content_type": "text/html", 
  "description": "HTML detailed output format, used for Detailed record pages.", 
  "mime_type": "application/x-foo"
}

Python

{'default': 'Default_HTML_detailed.tpl',
 'mimetype': 'text/html',
 'rules': [{'tag 980.a': [{'PICTURE': 'Picture_HTML_detailed.tpl'},
                          {'POETRY': 'Poetry_HTML_detailed.tpl'}]},
           {'tag 773.t': [{'Atlantis Times': 'Journal_HTML_detailed.bft'}]}]}

or

default = 'Default_HTML_detailed.tpl'
mimetype = 'text/html'
rules = [{'tag 980.a': [{'PICTURE': 'Picture_HTML_detailed.tpl'},
                          {'POETRY': 'Poetry_HTML_detailed.tpl'}]},
           {'tag 773.t': [{'Atlantis Times': 'Journal_HTML_detailed.bft'}]}]

We could also add rules for fields:

rules:
- "field collections":
  - Pictures: Picture_HTML_detailed.tpl
- "field author.last_name":
  - Ellis: Ellis_HTML_detail.tpl

cc @tiborsimko @kaplun

dset0x commented 9 years ago

I have hastily reconstructed and temporarily moved a formatter that used to be in the workflows module into the formatter module in order to have some working example in the checker module. It differs from the current formatters in that it accepts records instead of recIDs. The code and comments are -edit: NOT- very clear (or correct at some points, I have to talk with @jalavik about some things), but I wonder if we wish to keep the basic ideas behind it.

You can take a look here.

jirikuncar commented 9 years ago

You can take a look here.

@dset0x For multi record formatting we should use print_records.

I hope we will find soon an agreement on this RFC so the refactoring can start soon.

jirikuncar commented 9 years ago

@tiborsimko WDYT about https://github.com/inveniosoftware/invenio/issues/2662#issuecomment-69761930 ?

dset0x commented 9 years ago

@dset0x For multi record formatting we should use print_records.

That only works with recIDs (and pulls records out from the database).

Edit: Nevermind. It looks like it does that in the template, so I will try to patch that instead.

jirikuncar commented 9 years ago

... deal with formatting of a record representation ...

That only works with recIDs (and pulls records out from the database).

That's why this RFC is open :wink:

dset0x commented 9 years ago

That's why this RFC is open :wink:

Indeed, you are right. After going through the RFC a bit more thoroughly, I've decided not to make the changes now. I will fallback to the legacy module until the situation is more clear.

jirikuncar commented 9 years ago

Please check new output format syntax in #2587 (c77b38c). (cc @tiborsimko)

jirikuncar commented 9 years ago

[ ] Build legacy layer that implements most important methods from old api, using the new modules. Likely perhaps even the old engines could be supported if specifically enabled. Legacy layer must be completely separate from new module.

This can be build on top of Invenio-Formatter package upon request.

inveniosoftware / invenio