Closed tiborsimko closed 10 years ago
Originally on 2011-11-23
+1.
I would also like to request that as part of this ticket, a small utility be written for dumping and loading the index configuration from/to the database without using the web tools, so that (for example) the index configs can be kept in a git repository.
Originally on 2011-11-24
Replying to [comment:1 jblayloc]:
I would also like to request that as part of this ticket, a small utility be written for dumping and loading the index configuration from/to the database without using the web tools, so that (for example) the index configs can be kept in a git repository.
That's part of another project (inveniocfg dumper/loader) which touches all modules and which uses SQLAlchemy models for this. (Transforming index definitions into CONF-style files for easier editing.) So this will be taken care of there, exactly with your use case in mind.
Originally on 2012-03-26
Note that this ticket supersedes the Savannah's task #7572.
Originally on 2012-06-13
Example of get_words_from_foo()
with inter-subfield dependencies. It is necessary to index DOIs that are stored as, say:
024 7_ $a 10.1234/5678 $2 doi
024 7_ $a urn:isbn:0451450523 $2 urn
so get_doi_from_record()
should select only those 0247_a values where 0247_2 is equal to "doi".
(An example of a virtual field.)
Originally on 2012-08-15
Another note: for the fulltext index, it would be useful to have a possibility to distinguish files of the same format that are to be fulltext-indexed and files that are not to be fulltext-indexed. Say, depending on doctype. Example: file "a.pdf" of doctype "Preprint" would be fulltext-indexable, but file "b.pdf" of doctype "handwritten notes" would not be. So we could have a white-list or black-list of doctypes for every pdf->text process. Moreover, doctype is only one criteria to illustrate the issue, we could have more.
Currently we have CFG_BIBINDEX_FULLTEXT_INDEX_LOCAL_FILES_ONLY
. We could extend its type to have something like:
CFG_BIBINDEX_FULLTEXT_INDEX_LOCAL_FILES_ONLY = {
'pdf': {
'doctype': ['preprint', 'ebook'],
'docname': [/^SCAN/],
},
'ppt' : {
'docname': '*'
}
See also `CFG_BIBINDEX_PERFORM_OCR_ON_DOCNAMES'.
(Well, not under form of CFG variables, but migrated to the DB/conf, as the rest.)
Originally on 2013-10-03
The last iteration looks all good, doing minor changes and merging...
Originally by pglauner on 2013-10-03
In b6c6a0d3dfc4e4360b835c945b7d3fa2bd16a3cc:
#CommitTicketReference repository="invenio" revision="b6c6a0d3dfc4e4360b835c945b7d3fa2bd16a3cc"
BibIndex: centralisation of synonym treatment
- Removes CFG_BIBINDEX_SYNONYM_KBRS variable and moves the per-index
synonym definitions to the database. Adapts BibIndex Admin
interface accordingly. (addresses #852)
Co-authored-by: Grzegorz Szpura <grzegorz.szpura@cern.ch>
Signed-off-by: Grzegorz Szpura <grzegorz.szpura@cern.ch>
Tested-by: Tibor Simko <tibor.simko@cern.ch>
Originally on 2013-10-03
In 0a0c87e59ed6a09c054f57283d56906182a20f8f:
#CommitTicketReference repository="invenio" revision="0a0c87e59ed6a09c054f57283d56906182a20f8f"
BibIndex: centralisation of stopword treatment
- Moves CFG_BIBINDEX_REMOVE_STOPWORDS to database to idxINDEX table.
Adapts admin interface accordingly. (addresses #852)
- Additionally, fixes a small bug for modifysynonymkb function
in the BibIndex Admin interface.
Signed-off-by: Grzegorz Szpura <grzegorz.szpura@cern.ch>
Co-authored-by: Patrick Glauner <patrick.oliver.glauner@cern.ch>
Tested-by: Tibor Simko <tibor.simko@cern.ch>
Originally on 2013-10-03
In 0bf927bd7e2c0cb17b6c8317a689da78ce050a9e:
#CommitTicketReference repository="invenio" revision="0bf927bd7e2c0cb17b6c8317a689da78ce050a9e"
BibIndex: centralisation of LaTeX/HTML treatment
- Moves CFG_BIBINDEX_LATEX_MARKUP, CFG_BIBINXED_HTML_MARKUP
to the database to the idxINDEX table. Adapts admin interface.
(addresses #852)
Signed-off-by: Grzegorz Szpura <grzegorz.szpura@cern.ch>
Co-authored-by: Patrick Glauner <patrick.oliver.glauner@cern.ch>
Tested-by: Tibor Simko <tibor.simko@cern.ch>
Originally on 2013-10-03
In b0a11546ca41eebbf5e9c2f3efd8240d20600f2f:
#CommitTicketReference repository="invenio" revision="b0a11546ca41eebbf5e9c2f3efd8240d20600f2f"
BibIndex: centralisation of tokenizers
- Introduces tokenizers for journal, year, fulltext and authorcount indexes.
Merges old tokenizing functions for pairs, phrases and words into
one 'BibIndexDefaultTokenizer'. Introduces empty tokenizer.
Extends admin interface to support tokenizers. (closes #852)
Signed-off-by: Grzegorz Szpura <grzegorz.szpura@cern.ch>
Tested-by: Tibor Simko <tibor.simko@cern.ch>
Originally on 2013-10-03
In ab532de38f71b90c8af8ddbda9ba6fabcd42c674:
#CommitTicketReference repository="invenio" revision="ab532de38f71b90c8af8ddbda9ba6fabcd42c674"
BibIndex: pluginutils for tokenizers
- CFG_BIBINDEX_PATH_TO_STOPWORDS_FILE now is a
CFG_BIBRANK_PATH_TO_STOPWORDS_FILE.
* Indexes have their own path to stopwords file.
* Tokenizers are loaded with pluginutils.
(references #852)
Signed-off-by: Grzegorz Szpura <grzegorz.szpura@cern.ch>
Tested-by: Tibor Simko <tibor.simko@cern.ch>
Originally by pglauner on 2013-11-26
In b6c6a0d3dfc4e4360b835c945b7d3fa2bd16a3cc:
#CommitTicketReference repository="invenio" revision="b6c6a0d3dfc4e4360b835c945b7d3fa2bd16a3cc"
BibIndex: centralisation of synonym treatment
- Removes CFG_BIBINDEX_SYNONYM_KBRS variable and moves the per-index
synonym definitions to the database. Adapts BibIndex Admin
interface accordingly. (addresses #852)
Co-authored-by: Grzegorz Szpura <grzegorz.szpura@cern.ch>
Signed-off-by: Grzegorz Szpura <grzegorz.szpura@cern.ch>
Tested-by: Tibor Simko <tibor.simko@cern.ch>
Originally on 2013-11-26
In 0a0c87e59ed6a09c054f57283d56906182a20f8f:
#CommitTicketReference repository="invenio" revision="0a0c87e59ed6a09c054f57283d56906182a20f8f"
BibIndex: centralisation of stopword treatment
- Moves CFG_BIBINDEX_REMOVE_STOPWORDS to database to idxINDEX table.
Adapts admin interface accordingly. (addresses #852)
- Additionally, fixes a small bug for modifysynonymkb function
in the BibIndex Admin interface.
Signed-off-by: Grzegorz Szpura <grzegorz.szpura@cern.ch>
Co-authored-by: Patrick Glauner <patrick.oliver.glauner@cern.ch>
Tested-by: Tibor Simko <tibor.simko@cern.ch>
Originally on 2013-11-26
In 0bf927bd7e2c0cb17b6c8317a689da78ce050a9e:
#CommitTicketReference repository="invenio" revision="0bf927bd7e2c0cb17b6c8317a689da78ce050a9e"
BibIndex: centralisation of LaTeX/HTML treatment
- Moves CFG_BIBINDEX_LATEX_MARKUP, CFG_BIBINXED_HTML_MARKUP
to the database to the idxINDEX table. Adapts admin interface.
(addresses #852)
Signed-off-by: Grzegorz Szpura <grzegorz.szpura@cern.ch>
Co-authored-by: Patrick Glauner <patrick.oliver.glauner@cern.ch>
Tested-by: Tibor Simko <tibor.simko@cern.ch>
Originally on 2013-11-26
In b0a11546ca41eebbf5e9c2f3efd8240d20600f2f:
#CommitTicketReference repository="invenio" revision="b0a11546ca41eebbf5e9c2f3efd8240d20600f2f"
BibIndex: centralisation of tokenizers
- Introduces tokenizers for journal, year, fulltext and authorcount indexes.
Merges old tokenizing functions for pairs, phrases and words into
one 'BibIndexDefaultTokenizer'. Introduces empty tokenizer.
Extends admin interface to support tokenizers. (closes #852)
Signed-off-by: Grzegorz Szpura <grzegorz.szpura@cern.ch>
Tested-by: Tibor Simko <tibor.simko@cern.ch>
Originally on 2013-11-26
In ab532de38f71b90c8af8ddbda9ba6fabcd42c674:
#CommitTicketReference repository="invenio" revision="ab532de38f71b90c8af8ddbda9ba6fabcd42c674"
BibIndex: pluginutils for tokenizers
- CFG_BIBINDEX_PATH_TO_STOPWORDS_FILE now is a
CFG_BIBRANK_PATH_TO_STOPWORDS_FILE.
* Indexes have their own path to stopwords file.
* Tokenizers are loaded with pluginutils.
(references #852)
Signed-off-by: Grzegorz Szpura <grzegorz.szpura@cern.ch>
Tested-by: Tibor Simko <tibor.simko@cern.ch>
Originally on 2011-11-23
Currently, Invenio indexes can be configured in several ways:
a. Some configurations are done runtime, per-index, in the index DB table (idxINDEX), e.g. stemming language.
b. Some configurations are done in
invenio.conf
per-index, e.g. index-time synonym lists (CFG_BIBINDEX_SYNONYM_KBRS
).c. Some configurations are done in
invenio.conf
globally for all indexes, e.g. stop word lists (CFG_BIBINDEX_REMOVE_STOPWORDS
,CFG_BIBINDEX_PATH_TO_STOPWORDS_FILE
).d. Some configurations are hard coded in the source code, e.g. fuzzy author name tokenizer (
BibIndexFuzzyNameTokenizer
) is for author indexes via hard coded check for index name (e.g.firstauthor
).e. Some configurations are hard coded in the source code with arguments, e.g. journal index uses
get_words_from_journal_tag()
with format standardisation depending onCFG_JOURNAL_PUBINFO_STANDARD_FORM
.The goal of this ticket is to harmonise and centralise the configurations into DB table to make all of the above features configurable per index. This means, roughly speaking, to enlarge
idxINDEX
table with new columns so that not only stemming, but also tokenizer method for words and phrases, the stopword list, the synonym list, etc, could be defined at the runtime by manipulating the DB table, without touching source code orinvenio.conf
.The BibIndex Admin interface should be enriched consequently.
The work will bring various refactoring tasks such as separation of various
get_words_from_foo()
functions, taking advantage ofpluginutils.py
library.P.S. This ticket will have several sequels, e.g. about defining index type (native Invenio, Solr, Xapian) or e.g. about defining virtual logical fields that would gather information for indexing from non-MARC, non-fulltext sources (e.g. from the cataloguer log tables). These will be ticketised separately.