Closed lnielsen closed 9 years ago
As we in INSPIRE are going very soon have to start the work on re-factoring/re-designing the formats on our Invenio v.2.0 site it is nice to discuss this already - to be sure we are going in the right direction.
I think the proposal looks good so far :+1:
The general direction is good, especially concerning "regular" output formats. No comments there. For example, in the case of CERN Open Data Portal instance, we are already using Jinja/JSON based formatting way whenever possible.
Here are some more comments though, especially for "non-regular" outputs.
(1) More nuance is needed concerning "support only Jinja based formatting" and "removing XSLT" transformations. While our records' internal representation is JSON, the real bibliographic information of our records, or their master format, is either MARC, or Dublin Core, or EAD, or UNIMARC, or JSON, or whatever each concrete Invenio instance uses. (This depends on collections; one can have this-and-this master format for photos, and that-and-that master format for archival records, and even-yet-another master format for books.) Forcing format conversions, such as MARC-to-DublinCore or EAD-to-DublinCore, to always go via JSON, is not practical, because there are already available bibliographic conversion style sheets, say, that people can simply take and plug. See my recent comment https://github.com/inveniosoftware/invenio/issues/2075#issuecomment-69216423 in another thread with graphical representation of these two different formatting channels.
And, while I agree with the RFC for the use case of the first channel, we need to address also the need for the other channel.
In one way or another, one should be able to call some high-level API "to format this record in this format", which would sometimes spit out master format directly, sometimes spit out pre-cached format from "dissemination information store", sometimes use JSON/Jinja live to produce the wanted output, but sometimes also call XSLT or some other bibliographic library (such as pybliographer heavily used at EPFL) that will produce desired output format out of the native master format in the best possible way.
One can address my remark in several ways. E.g. the current BibFormat module contained several "engine components" that permitted to use more than one way of formatting the records to produce the wanted format. E.g. one can keep "formatter" to server only JSON-based formatting use cases, and invent a new module, say "exporter" or "disseminator", that would take care of creating/serving various wanted formats out of native ones independently. Whatever the way, we need to address and offer such a non-JSON-based formatting option in one way or another. Hence it should be elaborated as part of this RFC.
(2) Formatting pure JSON representation of a record may not be always enough when we need to show information from other records to generate the wanted output format. E.g. we may want to search and include information about 312 conference contributions records when formatting some conference proceedings record; this is fully dynamic. E.g. we may want to include the citation counts when showing references of a paper. Etc. Hence the formatter may want to use live search engine in these cases, as it would not be practical to represent all the various "linked information" as derived JSON fields for all the formatting needs. You probably thought of that, but we need to spell it out and think of what can be derived in JSON and what can be better done as live querying.
(3) Another related thing to consider is formatting more than one
record, or current format_records()
and our various summary formats
such as citation summary. It seems obvious to separate these out (and
summary formats were always separated), but it may good to muse about
"collection formatting" or "summary formatting" as part of the same
formatting RFC as well, perhaps. (Note the current separation of
summary formats in WebSearch/BibRank, which is not ideal. Could be
another component/module that uses search and rank as client, for
example.)
P.S. Why is this p_critical
? Note the current ticket priority rules that say:
Thanks for the detailed response and raised points :-)
P.S. Why is this p_critical?
My mistake. I've fixed it now.
Response in reverse order :-)
(3) Another related thing to consider is formatting more than one record.
Yes, I completely agree that this needs to be taken into account since it affects the output (the obvious case is MARCXML <collection>
-tag).
(2) Formatting pure JSON representation of a record may not be always enough when we need to show information from other records to generate the wanted output format.
I understand the problem you are describing of including dynamic data in the formats. In my opinion the dynamic data should be computed prior to the formatting (or perhaps rather call it rendering). Hence, the input to the formatter is some representation that contains all the information needed to render the output. I.e. querying for 312 conference proceedings should be done in an intermediate step prior to rendering the output. In my opinion this step should not happen inside the formatting.
(1) More nuance is needed concerning "support only Jinja based formatting" and "removing XSLT" transformations.
Yes, it's perhaps a bit too aggressive formulation and perhaps we want to replace Jinja in the future, so keeping the concept of engines makes sense to me, as long as these engines are properly isolated from one another.
E.g. one can keep "formatter" to server only JSON-based formatting use cases, and invent a new module, say "exporter" or "disseminator", that would take care of creating/serving various wanted formats out of native ones independently.
I think I agree with this approach. Most of the cases you list I think is a transformation between different data models using existing defined transformations. Separating this out I think will likely make the code cleaner.
In one way or another, one should be able to call some high-level API "to format this record in this format", which would sometimes spit out master format directly, sometimes spit out pre-cached format from "dissemination information store"
Yes, I do agree that we need high-level APIs that orchestrates rendering/formatting of records. My main issue with the current formatter is that its orchestration is not made up of small isolated modules that only does a very well defined job. E.g. caching and formatting is very intertwined, and there's all sorts of performance hacks. This makes maintenance error prone and very hard, you'll have a hard time moving e.g. caching to redis instead of bibfmt table.
I'll update the RFC with a list of the issues raised.
Dealing with dynamic data.
Only using calculated record fields.
Engines concept useful to abstract underlying technology.
Most of it can be done simply like:
Example formatter/baz.xml
:
{%- cached record['_id'] -%}
{% if record.additional_info['master_format'] == 'marc' %}
{{ record.legacy_export_as_marc() | xslt(pkg_resources.get_filename('foo.bar', 'baz.xslt')) }}
{% else %}
<foo>{{ record['title.title'] }}</foo>
{% endif %}
{%- endcached -%}
Thanks a lot for starting the discussion @lnielsen . Good points raised so far. As @jalavik mentioned, we are going to be working on this in the following months so happy to help out.
bfe_
elements and bfo
objects.The idea is to merge database table format
and *.bfo
files into one file per output format using some standard format/markup. See an example for detailed output format.
description: 'HTML detailed output format, used for Detailed record pages.'
content_type: text/html
mime_type: application/x-foo
visibility: 1
rules:
- "tag 980.a":
- PICTURE: Picture_HTML_detailed.tpl
- POETRY: Poetry_HTML_detailed.tpl
- "tag 773.t":
- "Atlantis Times": Journal_HTML_detailed.bft
default: Default_HTML_detailed.tpl
Option 2 for rules:
rules:
- tag: 980
value: PICTURE
template: Picture_HTML_detailed.tpl
- field: _collections
value: POETRY
template: Poetry_HTML_detailed.tp
{
"rules": [
{
"tag 980.a": [
{
"PICTURE": "Picture_HTML_detailed.tpl"
},
{
"POETRY": "Poetry_HTML_detailed.tpl"
}
]
},
{
"tag 773.t": [
{
"Atlantis Times": "Journal_HTML_detailed.bft"
}
]
}
],
"default": "Default_HTML_detailed.tpl",
"content_type": "text/html",
"description": "HTML detailed output format, used for Detailed record pages.",
"mime_type": "application/x-foo"
}
{'default': 'Default_HTML_detailed.tpl',
'mimetype': 'text/html',
'rules': [{'tag 980.a': [{'PICTURE': 'Picture_HTML_detailed.tpl'},
{'POETRY': 'Poetry_HTML_detailed.tpl'}]},
{'tag 773.t': [{'Atlantis Times': 'Journal_HTML_detailed.bft'}]}]}
or
default = 'Default_HTML_detailed.tpl'
mimetype = 'text/html'
rules = [{'tag 980.a': [{'PICTURE': 'Picture_HTML_detailed.tpl'},
{'POETRY': 'Poetry_HTML_detailed.tpl'}]},
{'tag 773.t': [{'Atlantis Times': 'Journal_HTML_detailed.bft'}]}]
We could also add rules for fields:
rules:
- "field collections":
- Pictures: Picture_HTML_detailed.tpl
- "field author.last_name":
- Ellis: Ellis_HTML_detail.tpl
cc @tiborsimko @kaplun
I have hastily reconstructed and temporarily moved a formatter that used to be in the workflows
module into the formatter
module in order to have some working example in the checker
module. It differs from the current formatters in that it accepts records instead of recIDs.
The code and comments are -edit: NOT- very clear (or correct at some points, I have to talk with @jalavik about some things), but I wonder if we wish to keep the basic ideas behind it.
You can take a look here.
You can take a look here.
@dset0x For multi record formatting we should use print_records.
I hope we will find soon an agreement on this RFC so the refactoring can start soon.
@tiborsimko WDYT about https://github.com/inveniosoftware/invenio/issues/2662#issuecomment-69761930 ?
@dset0x For multi record formatting we should use print_records.
That only works with recIDs (and pulls records out from the database).
Edit: Nevermind. It looks like it does that in the template, so I will try to patch that instead.
... deal with formatting of a record representation ...
That only works with recIDs (and pulls records out from the database).
That's why this RFC is open :wink:
That's why this RFC is open :wink:
Indeed, you are right. After going through the RFC a bit more thoroughly, I've decided not to make the changes now. I will fallback to the legacy module until the situation is more clear.
Please check new output format syntax in #2587 (c77b38c). (cc @tiborsimko)
- [ ] Build legacy layer that implements most important methods from old api, using the new modules. Likely perhaps even the old engines could be supported if specifically enabled. Legacy layer must be completely separate from new module.
This can be build on top of Invenio-Formatter package upon request.
While this is not the most urgent issue, I'd like a general opinion on if this is the way to go in the future. The RFC would have to be fleshed out to be more specific, so this is just the about agreeing on some principles.
Problem: Currently we support many different formatting engines with lots of legacy code which makes maintenance hard. The current code is mixing responsibilities - e.g. formatting of a record representation, fetching of that record, caching, debug, which also makes maintenance and replacement of individual parts hard. New code written for overlays is in danger of depending too much on legacy code (using e..g existing bfe_elements), thus making transition harder in the future.
Proposal:
format_record
should not take a recid, but say a JSON representation instead). Create separate modules to deal with caching, etc.P.S Don't shoot me :-)
Issues raised in comments: