BibFormat: add second formatting pass, on top of formats cache

jeromecaffaro commented 10 years ago

Originally on 2013-04-09

From Savannah Task #7030

Brief formats are usually pre-formatted and cached for later use, for performance reasons. In order to accommodate for changes in the record, the cache is invalidated upon metadata update. It is however sometimes needed to have the output updated independently of changes in the metadata. This is true for dependencies on other Invenio tables (eg. displaying how many people have cited this record), but also when some remote service would need to be called server-side, or when the output needs to be contextual to the user (eg. hiding links based on access rights).

The idea is then to introduce a second formatting phase, that will take care of doing some remaining light processing. It is proposed to handle this 2-pass formatting mainly at the granularity of format elements (BFE_*): either a format element is processed/executed at the first pass, either it is left to be processed/executed at the second pass. BibReformat would cache the result of the first formatting pass, leaving some unprocessed BFE_*s. Any call to the bibformat.format_record(...) function retrieving data from the cache (bibfmt table) would process the remaining BFE_*s as part of a second formatting pass.

It is also assumed in this proposal that the responsibility of choosing which format element is processed at the first or second pass is left to the admin or developer of the format element, not the formatting engine.

From the API point of view there should be no need to update client code. Only BibREformat would need to instruct the formatting engine to not process part of the elements (i.e. with some BFE_* tags left in the returned value):

< def format_record(recID, of, ln=CFG_SITE_LANG, verbose=0, search_pattern=None,
<                   xml_record=None, user_info=None, on_the_fly=False):
--

> def format_record(recID, of, ln=CFG_SITE_LANG, verbose=0, search_pattern=None,
>                   xml_record=None, user_info=None, on_the_fly=False):
>                   first_pass_only_p=False):
>   # Fetch from cache if exist:
>   #    1) if first_pass_only_p==False, call bibformat_engine to process remaining BFE_s*
>   #    2) if first_pass_only_p==True, return cache
>   # Elif cache does not exist:
>   #    3) call bibformat_engine.format_record(..., parameter(first_pass_only_p=first_pass_only_p)

One should however still be careful with modules that do not call the BibFormat APIs (in bibformat.py), but the engine directly (webjournal.py, docextract_webinterface.py). Care should also be taken with code accessing the cache "directly" (for eg. via bibformat_dblayer.get_preformatted_record(...) or even direct access to the bibfmt table) in order to make sure that a completely formatted output is always returned.

In order to define which BFE_* must not be processed by the formatting engine at the first pass, it is proposed to investigate the following possibilities:

Add a default parameter nocache to BFE_*s, such that one can define within a format template if a given element needs to be processed live. For eg.

<BFE_TITLE />
<BFE_ABSTRACT />
<BFE_RECORD_STATS nocache="true"/>

This `nocache` parameter would not need be defined by the format element, but would be provided by default as already done for `prefix`, `suffix`, `default` and `escape`. Such element marked with this parameter would only be processed in the second pass.

Have the format elements declare themselves if they are supposed to be cached or not. In the same way as format elements already declare if they are supposed to be escaped or not via their def escape_values(bfo) function (http://invenio-demo.cern.ch/help/admin/bibformat-admin-guide#escapeFormatElement), one could think adding a function format_live(bfo) returning whether live formatting is necessary or not. Similarly to format element escaping, this might override the nocache parameter possibly set in the format template (if used in combination with above point (1)).
Replace the format_element(bfo) function with a format_element_live(bfo) one in the BFE_*s that need to be processed only in the second pass.
As an alternative to the above points, one could think extending the concept by still processing all BFE_*s in the first pass, but adding an optional format_element_postprocess(bfo, first_pass_output) function, which would be called (when defined) in the second pass with the result of the first pass for that format element. That would require defining a way to store the intermediate state of such element in the cache, including the result of the first pass. This would allow mixed behaviour of format elements, in case part of the heavy processing can be done offline/beforehand.
Alternatively (or in combination) one could imagine processing IN ANY CASE all the BFE_*s in the first pass, and force those dynamic/live one to generate some code that include other BFE_*s. For eg. bfe_record_stats.format_element(bfo) would output "Record stats <b><BFE_RECORD_STATS_LIVE /></b>" or "Record stats <b><BFE_RECORD_STATS live="true" /></b>" that would be caught in the second pass.

Ideally this 2nd formatting pass should remain as light as possible. That should be made clear in the admin/hacking guide, and some careful design decisions should be taken to avoid "mistakes" by the admin. For eg. the nocache possibility mentioned above (1) might make it too easy for templates admins to slow down the system, while option (4) and (5) might give developers the incentive and possibility to build carefully-designed elements. Other ideas could include for eg. in option (3) the removal of the bfo parameter in the format_element_live() function to "make sure" that the developer will not access the metadata of the record unnecessarily (while it could have been done in the first pass), and might speed up the 2nd pass by eliminating the need to instantiate BibFormatObjects at runtime.

Some investigation should also be done regarding double-processing after the introduction of the second formatting pass: currently format elements cannot be nested (for eg <BFE_ASBTRACT prefix="<BFE_TITLE/>" />, but a second-pass formatting would probably make this possible, up to one more level (but not more). It would also change the behaviour regarding format elements that output markup that is usually interpreted by the formatting engine (mainly tags starting with <BFE_, but also I18N markup: <lang>, <en>, etc.). Note that these side-effects are probably for the better, but it is good to keep them in mind.

In addition to the delayed processing of format elements in a second pass, one could think about moving all the language-dependant processing in the second pass too (<lang> tags https://invenio-demo.cern.ch/help/admin/bibformat-admin-guide#internationalizationTemplate, _()_ syntax http://invenio-demo.cern.ch/help/admin/webstyle-webdoc-syntax#translation2).

It should be quick to evaluate if a second formatting pass is necessary or not: this could be done with almost no performance impact while parsing the template in the first pass. That might however require to change the structure returned by bibformat_engine.format_record(...), for eg. (output, second_pass_needed_p) (which is okay, given that the API is bibformat.formate_record(...)). The boolean second_pass_needed_p could be stored along with the cached format in bibfmt table, or used immediately when first_pass_only_p==False.

Note that this ticket does not consider caches created at other levels, such as in WebColl or WebJournal static pages.

jeromecaffaro commented 10 years ago

Originally on 2013-04-09

Note that option (5) does not allow to immediately identify which format elements need a 2nd pass formatting (in case we would like to display that info to the admin): the element must have been executed and its output parsed to see if some <BFE_ must be further processed.

Note also that some format elements do import others via Python, for eg. bfe_aid_authors.py, with from invenio.bibformat_elements.bfe_server_info import format_element as bfe_server. The importing element would then need to inherit the second pass formatting behavior from the imported one, in a way that ensures that no "dynamic" part would be cached, but also that the result after the 2nd formatting pass does not leave some unprocessed BFE_*. Option (5) might be more advantageous on that point.

invenio-developers commented 10 years ago

Originally by adeiana (@Osso) on 2013-04-10

After inspection, I think the 2nd pass would have to behave differently that the 1st pass. As you said, we are using it for a different purpose and that step should hardly make any query. Which means passing the whole by default to the format is a waste of ressources.

I think we can create a light bfo object and pass it in format_with_format_template.

I like the option 3: having a format_element_live

Most of our formats don't need live formatting. It would not add addition function in the default case.
We would have to store the fact that need live formatting with the cache, so we would have to modify the DB

invenio-developers commented 10 years ago

Originally by adeiana (@Osso) on 2013-04-10

My code is here adeiana/1464-bibformat-2nd-pass

We need a better way to ouput tags. It seems that using nocache="true" is the easiest way to achieve this. Other ideas?

jeromecaffaro commented 10 years ago

Originally on 2013-04-11

Note that there should be no need to expose the 2nd pass formatting function in the APIs (i.e. not in bibformat.py, only in bibformat_engine.py): in that way the first/second pass business gets transparent for clients using bibformat.format_record(...) function, and we avoid the possibility to get half-formatted result:

In [1] print format_record(11, "HB")
Out [1] '''
<b>Infrared constraints on the dark mass concentration observed in the cluster Abell 1942</b>
<br/> Gray, M E ; Ellis, R S ; Lewis, J R ; McMahon, R G ; et al [astro-ph/0101431]
<BFE_RECORD_STATS nocache="true" />
'''

Only BibReformat would need to stop the formatting after the second pass.

Replying to [comment:2 adeiana]:

I think we can create a light bfo object and pass it in format_with_format_template.

Yes. Note that BibFormatObject should in principle be lazy when xml_record parameter is not provided, and load the full record only on-demand. It should be further tested, but there should not be a big overhead in instantiating the full object (in the same was as already done for the first pass), especially if that information is readily available "for free" (user_info, search_pattern, ln). In that way we ensure that the second pass can fulfil most of the use cases described above (contextual formatting based on access rights, language, search query). Consequently to the "lazy" behaviour of the BibFormatObject, the main overhead to be considered should be simply in the creation of the Python instances. If I had to make the BibFormatObject "lighter", I'd rather instantiate it with a recID=0 and xml_record=None to prevent any access to the metadata in the second pass, to avoid any risk that developers of format elements get into trouble by mistake (though in principle the access to the metadata should not be that slow, so we should probably no do that).

It should also be decided if all the language-dependant formatting should be moved to the 2nd pass. One could consider at first sight that it would heavily impact the performances, but:

It should be almost instant to check after the first pass if some <lang> or _( string remains in the resulting output. In that way sites and formats without any translation would not get any overhead in the 2nd formatting pass.
Currently BibReformat only caches the results in CFG_SITE_LANG, such that access to search results in a different language triggers on-the-fly formatting (unless CFG_BIBFORMAT_DISABLE_I18N_FOR_CACHED_FORMATS is set accordingly). The performance benefits in serving internationalized output on the basis of pre-cached data would be significant (and would avoid having to create a separate cache for each language in order to reach a very similar gain).

For these reasons I would be in favour of moving the language-related processing in the second pass.

Replying to [comment:5 adeiana]:

We need a better way to ouput tags. It seems that using nocache="true" is the easiest way to achieve this. Other ideas?

Not sure I understand well your proposal. We can maybe meet to clarify?

invenio-developers commented 10 years ago

Originally by adeiana (@Osso) on 2013-04-11

I have updated the code with the interface you described. We now have only one format_record. I made it so it is storing the fact that it needs a 2nd pass when we cache the format in bibfmt. Therefore it is only doing the extra step when needed. nocache="1" triggers the need for a second pass.

For switching the i18n handling, I think we can define a new bfe maybe named bfelang and we would replace the function in the elements by one that outputs the tag? However it would be escaped so I am not sure what would be the best way to handle that yet.

jeromecaffaro commented 10 years ago

Originally on 2013-04-15

Replying to [comment:7 adeiana]:

For switching the i18n handling, I think we can define a new bfe maybe named bfelang and we would replace the function in the elements by one that outputs the tag? However it would be escaped so I am not sure what would be the best way to handle that yet.

Maybe that simply moving all the language handling code out of format_record_1st_pass(..) to format_record_2nd_pass(..) would do the trick.

I am also wondering if it is necessary to expose format_record_1st_pass(..) in bibformat.py. It is probably enough to keep it internal to the BibFormat module, in bibformat_engine.py.

invenio-developers commented 10 years ago

Originally by adeiana (@Osso) on 2013-04-15

That's true we don't need to expose it, I therefore moved format_record_1st_pass to bibformat_engine.py

The problem with the language handling comes from the bfeelements that call gettext directly. So in them you would them () returning text in the right language. Which does not require any handling in bibreformat so there is none. What makes the problem worse is that if we return it'll be escapped.

invenio-developers commented 10 years ago

Originally by adeiana (@Osso) on 2013-05-07

I think we should open another ticket to handle the translation in the 2nd but not block this one because I do not see a good way to implement it.

jeromecaffaro commented 10 years ago

Originally on 2013-05-07

I agree that that the use of _() within BFE_ cannot be easily handled. It should probably not even be handled, but left as responsibility to the developer of the BFE_ to use the right tools in case (s)he wants to provide "live" i18nized formatting.

What I thought would be easy to handle as part of the live formatting was the language handling done at the level of format templates (not BFE_), such that a format template developer could write:

<lang>
  <en>Title:<en>
  <fr>Titre:<fr>
  <de>Titel:</de>
</lang>
<BFE_MYTITLE />
<input type="button" value="_(Record)_"/>

Note the use of _()_ and <lang>. These should be rather easily doable for live formatting (the code exists, but is placed in the format_record_1st_pass(...) instead of the format_record_2nd_pass(...)).

By extension, a BFE_ developer could still make use of this to provide i18nized live-formatting:

def format_element(bfo):
    [...]
    return "_(Title)_: " + output

def escape_values(bfo):
    return 0

or

def format_element(bfo):
    [...]
    return "<lang><en>Title</en><fr>Titre</fr></lang>: " + output

def escape_values(bfo):
    return 0

(Note the use of escape_values() function to prevent the escaping of the tags). These would all be handled by the format template parser/engine, after having evaluated the format elements.

What do you think? Given the important change introduced with this ticket I would see it more efficient/safer to do the change at the same time.

invenio-developers commented 10 years ago

Originally by adeiana (@Osso) on 2013-05-15

Thanks for the explanation.

I have written a patch on top that moves the language handling in the 2nd pass when a second pass is needed. Otherwsie it is still done in the 1st pass.

jeromecaffaro commented 10 years ago

Originally on 2013-05-31

Some additional comments:

bibformat_dblayer.get_preformatted_record(..), bibformat_engine.format_record(..), etc. docstrings should be updated to reflect the change in returned struture.
calls to functions which now return a different structure should be checked. For eg. bfe_oai_marcxml.py uses get_preformatted_record(..) and must be updated accordingly.
if CFG_BIBFORMAT_USE_OLD_BIBFORMAT is set, I suspect we might receive half-formatted records (search_engine.py line 4726). We might also very well remove the support for PHP BibFormat at the same time.
as far as I can see bibreformat has not been updated to respect the no_cache parameter (it stores the fully formatted record in the DB), which kind of defeats the purpose of this branch ;-) (unless one only relies on lazy formatting. Is it the plan? If so we'd better have a way to prevent bibreformat from running on installations that want to use 2nd pass formatting, though I'd personally prefer to see bibreformat updated for the new feature).

invenio-developers commented 10 years ago

Originally by adeiana (@Osso) on 2014-01-09

In ac059022b55d0848a790fc8c42e8a7dc01cd555e:

#CommitTicketReference repository="invenio" revision="ac059022b55d0848a790fc8c42e8a7dc01cd555e"
BibFormat: second formatting pass

- Add second formatting pass, on top of formats cache
  (closes #1464)

Signed-off-by: Alessio Deiana <alessio.deiana@cern.ch>
Reviewed-by: Jerome Caffaro <jerome.caffaro@cern.ch>

inveniosoftware / invenio

BibFormat: add second formatting pass, on top of formats cache #1464

From Savannah Task #7030