Open kaplun opened 10 years ago
+1
This can nowadays be refactored as methods to be added into Tokenizers.
A common use case: when a record is removed from the index (for example with bibindex --del --id 123
) bibindex_engine needs to know which are the dependant records that need to be reindexed.
Another requirement for the bibindex
command is that the indexing process must be associative when selecting records using the --id
and --collection
parameters. For example, the execution of bibindex --id 100-200
and bibindex --id 200-300
must be the same of bibindex --id 100-300
. Without the dependent records there was not this problem.
To be clear, for "dependant record" I mean a record whose tokenizer takes tokens not only from the record tag values, but also from other records or using an external function. For example, a tokenizer that given a record id returns the canonical author ids is a tokenizer of a dependant recod. Another example are the records under authority control: they are indexed using values from authority records.
I am implementing two methods in the tokenizers: get_modified_recids(date_range, index_name)
and get_affected_recids(modified_recids, index_name)
:
class BibIndexXXXTokenizer(object):
...
@staticmethod
def get_modified_recids(date_range, index_name):
"""Returns all the records that need to be reindexed using this
tokenizer due to an action happened in the specified date range.
Assumes that the tokenizer is used for the index index_name.
If a record needs to be updated due to a modification happened to
another record, use get_affected_recids() insthead of this method.
@param date_range: the dates between whom this function will look for
modified records. If the end_date is None this function will look
for modified records after start_date
@type date_range: tuple (start_date, end_date)
@param index_name: the name of the index
@type index_name: string
@return: the modified records
@type return: intbitset
"""
...
@staticmethod
def get_affected_recids(modified_recids, index_name):
"""Returns all the records that need to be reindexed using this
tokenizer due to a modification happened to the records in
modified_recids.
Assumes that the tokenizer is used for the index index_name.
If a record needs to be updated due to an action that did not affected
another record, use get_modified_recids() insthead of this method.
@param modified_recids: the ids of the modified records
@type modified_recids: intbitset
@param index_name: the name of the index
@type index_name: string
@return: the affected records
@type return: intbitset
"""
...
This solution removes from bibindex_engine.py
all the special cases, moving them to the appropriate tokenizer. For example, in get_recIDs_by_date_bibliographic
the special case for the fulltext text_extraction_date will be checked only when using the BibIndexFulltextTokenizer, not anymore for every index.
Originally on 2014-02-21
Currently BibIndex hard-codes several algorithms in order to discover which records are affected WRT certain indexes and need hence to be reindexed.
hstRECORD
table analysis to discover modified MARC fields.'author', 'firstauthor', 'exactauthor', 'exactfirstauthor'
indexes, that query theBibAuthorID
tables.bibdoc
table information.This should be generalized and transferred into plugins. An example of solution would be to extend the current
BibIndexDefaultTokenizer
and introduce a new optional method such as: