inveniosoftware / invenio

Invenio digital library framework
https://invenio.readthedocs.io
MIT License
625 stars 292 forks source link

BibIndex: refactor affected records computation #1710

Open kaplun opened 10 years ago

kaplun commented 10 years ago

Originally on 2014-02-21

Currently BibIndex hard-codes several algorithms in order to discover which records are affected WRT certain indexes and need hence to be reindexed.

  1. Namely there is a generic method that is valid for most of the indexes and based on hstRECORD table analysis to discover modified MARC fields.
  2. Then there is an hard-coded method for the 'author', 'firstauthor', 'exactauthor', 'exactfirstauthor' indexes, that query the BibAuthorID tables.
  3. Then there is the fulltext related one, that checks for bibdoc table information.

This should be generalized and transferred into plugins. An example of solution would be to extend the current BibIndexDefaultTokenizer and introduce a new optional method such as:

def get_affected_recids_since(self, date):
    pass
fpoli commented 10 years ago

+1

kaplun commented 10 years ago

This can nowadays be refactored as methods to be added into Tokenizers.

fpoli commented 10 years ago

A common use case: when a record is removed from the index (for example with bibindex --del --id 123) bibindex_engine needs to know which are the dependant records that need to be reindexed.

Another requirement for the bibindex command is that the indexing process must be associative when selecting records using the --id and --collection parameters. For example, the execution of bibindex --id 100-200 and bibindex --id 200-300 must be the same of bibindex --id 100-300. Without the dependent records there was not this problem.

To be clear, for "dependant record" I mean a record whose tokenizer takes tokens not only from the record tag values, but also from other records or using an external function. For example, a tokenizer that given a record id returns the canonical author ids is a tokenizer of a dependant recod. Another example are the records under authority control: they are indexed using values from authority records.

I am implementing two methods in the tokenizers: get_modified_recids(date_range, index_name) and get_affected_recids(modified_recids, index_name):

class BibIndexXXXTokenizer(object):
    ...

    @staticmethod
    def get_modified_recids(date_range, index_name):
        """Returns all the records that need to be reindexed using this
        tokenizer due to an action happened in the specified date range.
        Assumes that the tokenizer is used for the index index_name.

        If a record needs to be updated due to a modification happened to
        another record, use get_affected_recids() insthead of this method.

        @param date_range: the dates between whom this function will look for
            modified records. If the end_date is None this function will look
            for modified records after start_date
        @type date_range: tuple (start_date, end_date)
        @param index_name: the name of the index
        @type index_name: string
        @return: the modified records
        @type return: intbitset
        """
        ...

    @staticmethod
    def get_affected_recids(modified_recids, index_name):
        """Returns all the records that need to be reindexed using this
        tokenizer due to a modification happened to the records in
        modified_recids.
        Assumes that the tokenizer is used for the index index_name.

        If a record needs to be updated due to an action that did not affected
        another record, use get_modified_recids() insthead of this method.

        @param modified_recids: the ids of the modified records
        @type modified_recids: intbitset
        @param index_name: the name of the index
        @type index_name: string
        @return: the affected records
        @type return: intbitset
        """
        ...

This solution removes from bibindex_engine.py all the special cases, moving them to the appropriate tokenizer. For example, in get_recIDs_by_date_bibliographic the special case for the fulltext text_extraction_date will be checked only when using the BibIndexFulltextTokenizer, not anymore for every index.