Closed jeromecaffaro closed 10 years ago
Originally on 2013-04-09
Note that option (5) does not allow to immediately identify which format elements need a 2nd pass formatting (in case we would like to display that info to the admin): the element must have been executed and its output parsed to see if some <BFE_
must be further processed.
Note also that some format elements do import others via Python, for eg. bfe_aid_authors.py
, with
from invenio.bibformat_elements.bfe_server_info import format_element as bfe_server
.
The importing element would then need to inherit the second pass formatting behavior from the imported one, in a way that ensures that no "dynamic" part would be cached, but also that the result after the 2nd formatting pass does not leave some unprocessed BFE_*
. Option (5) might be more advantageous on that point.
Originally by adeiana (@Osso) on 2013-04-10
After inspection, I think the 2nd pass would have to behave differently that the 1st pass. As you said, we are using it for a different purpose and that step should hardly make any query. Which means passing the whole by default to the format is a waste of ressources.
I think we can create a light bfo object and pass it in format_with_format_template.
I like the option 3: having a format_element_live
Originally by adeiana (@Osso) on 2013-04-10
My code is here adeiana/1464-bibformat-2nd-pass
We need a better way to ouput tags. It seems that using nocache="true" is the easiest way to achieve this. Other ideas?
Originally on 2013-04-11
Note that there should be no need to expose the 2nd pass formatting function in the APIs (i.e. not in bibformat.py
, only in bibformat_engine.py
): in that way the first/second pass business gets transparent for clients using bibformat.format_record(...)
function, and we avoid the possibility to get half-formatted result:
In [1] print format_record(11, "HB")
Out [1] '''
<b>Infrared constraints on the dark mass concentration observed in the cluster Abell 1942</b>
<br/> Gray, M E ; Ellis, R S ; Lewis, J R ; McMahon, R G ; et al [astro-ph/0101431]
<BFE_RECORD_STATS nocache="true" />
'''
Only BibReformat would need to stop the formatting after the second pass.
Replying to [comment:2 adeiana]:
I think we can create a light bfo object and pass it in format_with_format_template.
Yes. Note that BibFormatObject should in principle be lazy when xml_record
parameter is not provided, and load the full record only on-demand. It should be further tested, but there should not be a big overhead in instantiating the full object (in the same was as already done for the first pass), especially if that information is readily available "for free" (user_info
, search_pattern
, ln
). In that way we ensure that the second pass can fulfil most of the use cases described above (contextual formatting based on access rights, language, search query). Consequently to the "lazy" behaviour of the BibFormatObject, the main overhead to be considered should be simply in the creation of the Python instances. If I had to make the BibFormatObject "lighter", I'd rather instantiate it with a recID=0
and xml_record=None
to prevent any access to the metadata in the second pass, to avoid any risk that developers of format elements get into trouble by mistake (though in principle the access to the metadata should not be that slow, so we should probably no do that).
It should also be decided if all the language-dependant formatting should be moved to the 2nd pass. One could consider at first sight that it would heavily impact the performances, but:
<lang>
or _(
string remains in the resulting output. In that way sites and formats without any translation would not get any overhead in the 2nd formatting pass.CFG_SITE_LANG
, such that access to search results in a different language triggers on-the-fly formatting (unless CFG_BIBFORMAT_DISABLE_I18N_FOR_CACHED_FORMATS
is set accordingly). The performance benefits in serving internationalized output on the basis of pre-cached data would be significant (and would avoid having to create a separate cache for each language in order to reach a very similar gain).For these reasons I would be in favour of moving the language-related processing in the second pass.
Replying to [comment:5 adeiana]:
We need a better way to ouput tags. It seems that using nocache="true" is the easiest way to achieve this. Other ideas?
Not sure I understand well your proposal. We can maybe meet to clarify?
Originally by adeiana (@Osso) on 2013-04-11
I have updated the code with the interface you described. We now have only one format_record. I made it so it is storing the fact that it needs a 2nd pass when we cache the format in bibfmt. Therefore it is only doing the extra step when needed. nocache="1" triggers the need for a second pass.
For switching the i18n handling, I think we can define a new bfe maybe named bfelang and we would replace the function in the elements by one that outputs the tag? However it would be escaped so I am not sure what would be the best way to handle that yet.
Originally on 2013-04-15
Replying to [comment:7 adeiana]:
For switching the i18n handling, I think we can define a new bfe maybe named bfelang and we would replace the function in the elements by one that outputs the tag? However it would be escaped so I am not sure what would be the best way to handle that yet.
Maybe that simply moving all the language handling code out of format_record_1st_pass(..)
to format_record_2nd_pass(..)
would do the trick.
I am also wondering if it is necessary to expose format_record_1st_pass(..)
in bibformat.py. It is probably enough to keep it internal to the BibFormat module, in bibformat_engine.py
.
Originally by adeiana (@Osso) on 2013-04-15
That's true we don't need to expose it, I therefore moved format_record_1st_pass to bibformat_engine.py
The problem with the language handling comes from the bfeelements that call gettext directly.
So in them you would them () returning text in the right language. Which does not require any handling in bibreformat
so there is none.
What makes the problem worse is that if we return
Originally by adeiana (@Osso) on 2013-05-07
I think we should open another ticket to handle the translation in the 2nd but not block this one because I do not see a good way to implement it.
Originally on 2013-05-07
I agree that that the use of _()
within BFE_
cannot be easily handled. It should probably not even be handled, but left as responsibility to the developer of the BFE_
to use the right tools in case (s)he wants to provide "live" i18nized formatting.
What I thought would be easy to handle as part of the live formatting was the language handling done at the level of format templates (not BFE_
), such that a format template developer could write:
<lang>
<en>Title:<en>
<fr>Titre:<fr>
<de>Titel:</de>
</lang>
<BFE_MYTITLE />
<input type="button" value="_(Record)_"/>
Note the use of _()_
and <lang>
. These should be rather easily doable for live formatting (the code exists, but is placed in the format_record_1st_pass(...)
instead of the format_record_2nd_pass(...)
).
By extension, a BFE_
developer could still make use of this to provide i18nized live-formatting:
def format_element(bfo):
[...]
return "_(Title)_: " + output
def escape_values(bfo):
return 0
or
def format_element(bfo):
[...]
return "<lang><en>Title</en><fr>Titre</fr></lang>: " + output
def escape_values(bfo):
return 0
(Note the use of escape_values()
function to prevent the escaping of the tags). These would all be handled by the format template parser/engine, after having evaluated the format elements.
What do you think? Given the important change introduced with this ticket I would see it more efficient/safer to do the change at the same time.
Originally by adeiana (@Osso) on 2013-05-15
Thanks for the explanation.
I have written a patch on top that moves the language handling in the 2nd pass when a second pass is needed. Otherwsie it is still done in the 1st pass.
Originally on 2013-05-31
Some additional comments:
bibformat_dblayer.get_preformatted_record(..)
,
bibformat_engine.format_record(..)
, etc. docstrings should be updated to
reflect the change in returned struture.bfe_oai_marcxml.py
uses get_preformatted_record(..)
and must be updated accordingly.CFG_BIBFORMAT_USE_OLD_BIBFORMAT
is set, I suspect we might
receive half-formatted records (search_engine.py
line 4726). We
might also very well remove the support for PHP BibFormat at the
same time.bibreformat
has not been updated to respect
the no_cache
parameter (it stores the fully formatted record in
the DB), which kind of defeats the purpose of this branch ;-)
(unless one only relies on lazy formatting. Is it the plan? If so
we'd better have a way to prevent bibreformat
from running on
installations that want to use 2nd pass formatting, though I'd
personally prefer to see bibreformat
updated for the new feature).Originally by adeiana (@Osso) on 2014-01-09
In ac059022b55d0848a790fc8c42e8a7dc01cd555e:
#CommitTicketReference repository="invenio" revision="ac059022b55d0848a790fc8c42e8a7dc01cd555e"
BibFormat: second formatting pass
- Add second formatting pass, on top of formats cache
(closes #1464)
Signed-off-by: Alessio Deiana <alessio.deiana@cern.ch>
Reviewed-by: Jerome Caffaro <jerome.caffaro@cern.ch>
Originally on 2013-04-09
From Savannah Task #7030
Brief formats are usually pre-formatted and cached for later use, for performance reasons. In order to accommodate for changes in the record, the cache is invalidated upon metadata update. It is however sometimes needed to have the output updated independently of changes in the metadata. This is true for dependencies on other Invenio tables (eg. displaying how many people have cited this record), but also when some remote service would need to be called server-side, or when the output needs to be contextual to the user (eg. hiding links based on access rights).
The idea is then to introduce a second formatting phase, that will take care of doing some remaining light processing. It is proposed to handle this 2-pass formatting mainly at the granularity of format elements (
BFE_*
): either a format element is processed/executed at the first pass, either it is left to be processed/executed at the second pass. BibReformat would cache the result of the first formatting pass, leaving some unprocessedBFE_*s
. Any call to thebibformat.format_record(...)
function retrieving data from the cache (bibfmt
table) would process the remainingBFE_*s
as part of a second formatting pass.It is also assumed in this proposal that the responsibility of choosing which format element is processed at the first or second pass is left to the admin or developer of the format element, not the formatting engine.
From the API point of view there should be no need to update client code. Only BibREformat would need to instruct the formatting engine to not process part of the elements (i.e. with some
BFE_*
tags left in the returned value):One should however still be careful with modules that do not call the BibFormat APIs (in
bibformat.py
), but the engine directly (webjournal.py
,docextract_webinterface.py
). Care should also be taken with code accessing the cache "directly" (for eg. viabibformat_dblayer.get_preformatted_record(...)
or even direct access to thebibfmt
table) in order to make sure that a completely formatted output is always returned.In order to define which
BFE_*
must not be processed by the formatting engine at the first pass, it is proposed to investigate the following possibilities:nocache
toBFE_*s
, such that one can define within a format template if a given element needs to be processed live. For eg.def escape_values(bfo)
function (http://invenio-demo.cern.ch/help/admin/bibformat-admin-guide#escapeFormatElement), one could think adding a functionformat_live(bfo)
returning whether live formatting is necessary or not. Similarly to format element escaping, this might override thenocache
parameter possibly set in the format template (if used in combination with above point (1)).format_element(bfo)
function with aformat_element_live(bfo)
one in theBFE_*s
that need to be processed only in the second pass.BFE_*s
in the first pass, but adding an optionalformat_element_postprocess(bfo, first_pass_output)
function, which would be called (when defined) in the second pass with the result of the first pass for that format element. That would require defining a way to store the intermediate state of such element in the cache, including the result of the first pass. This would allow mixed behaviour of format elements, in case part of the heavy processing can be done offline/beforehand.BFE_*s
in the first pass, and force those dynamic/live one to generate some code that include otherBFE_*s
. For eg.bfe_record_stats.format_element(bfo)
would output"Record stats <b><BFE_RECORD_STATS_LIVE /></b>"
or"Record stats <b><BFE_RECORD_STATS live="true" /></b>"
that would be caught in the second pass.Ideally this 2nd formatting pass should remain as light as possible. That should be made clear in the admin/hacking guide, and some careful design decisions should be taken to avoid "mistakes" by the admin. For eg. the
nocache
possibility mentioned above (1) might make it too easy for templates admins to slow down the system, while option (4) and (5) might give developers the incentive and possibility to build carefully-designed elements. Other ideas could include for eg. in option (3) the removal of thebfo
parameter in theformat_element_live()
function to "make sure" that the developer will not access the metadata of the record unnecessarily (while it could have been done in the first pass), and might speed up the 2nd pass by eliminating the need to instantiateBibFormatObjects
at runtime.Some investigation should also be done regarding double-processing after the introduction of the second formatting pass: currently format elements cannot be nested (for eg
<BFE_ASBTRACT prefix="<BFE_TITLE/>" />
, but a second-pass formatting would probably make this possible, up to one more level (but not more). It would also change the behaviour regarding format elements that output markup that is usually interpreted by the formatting engine (mainly tags starting with<BFE_
, but also I18N markup:<lang>
,<en>
, etc.). Note that these side-effects are probably for the better, but it is good to keep them in mind.In addition to the delayed processing of format elements in a second pass, one could think about moving all the language-dependant processing in the second pass too (
<lang>
tags https://invenio-demo.cern.ch/help/admin/bibformat-admin-guide#internationalizationTemplate,_()_
syntax http://invenio-demo.cern.ch/help/admin/webstyle-webdoc-syntax#translation2).It should be quick to evaluate if a second formatting pass is necessary or not: this could be done with almost no performance impact while parsing the template in the first pass. That might however require to change the structure returned by
bibformat_engine.format_record(...)
, for eg.(output, second_pass_needed_p)
(which is okay, given that the API isbibformat.formate_record(...)
). The booleansecond_pass_needed_p
could be stored along with the cached format inbibfmt
table, or used immediately whenfirst_pass_only_p==False
.Note that this ticket does not consider caches created at other levels, such as in WebColl or WebJournal static pages.