medialab / hyphe2solr

8 stars 3 forks source link

principles

This scripts needs :

What it does is :

dependencies

HYPHE

This script relies on an existing Hyphe server running. see https://github.com/medialab/Hypertext-Corpus-Initiative

SOLR

This script relies on an existing solr server running. see https://lucene.apache.org/solr/

python requirements

sunburnt lxml httplib2 pymongo jsonrpclib argparse #for python<2.7

INSTALL

You need a hyphe and a solr server running.

git clone this repository

Than simply executes (ideally in a virtualenv):

pip install -r requirements.txt

CONFIGURE

hyphe SOLR schema

use the solr node example provided in solr_hyphe_core directory. the script deploy_solr_core.sh might helps you. Change the solr core path and tomcat user/service (depends on your install) in the script before using it. BEWARE : It will erase any hyphe core already present in solr core path.

You should review the script before using it.

connection to data sources

Copy config.json.default into config.json and edit the parameters :

Mime-type filter

Hyphe2solr proposes you to filter out web pages which doesn't have a mimetype compatible with solr indexing (our schema don't use TIKKA). The script generate_content_filter.py outputs from the mongodb (version >2.1 only) a CSV listing the cotent-type ordered by number of pages found in the mongo. From this csv you have to write the content_type_whitelist.txt file. This file must contain one mimetype (to be indexed) by line. An example is provided : content_type_whitelist.txt.default

usage

Once you prepared the configuration, simply use :

$ python index_hyphe_web_pages.py

Only one option which delete the existing index before (re)indexing

$ python index_hyphe_web_pages.py -h
usage: index_hyphe_web_pages.py [-h] [-d]

optional arguments:
  -h, --help          show this help message and exit
  -d, --delete_index  delete solr index before (re)indexing. WARNING all
                      previous indexing work will be lost.

If calling index_hyphe_web_pages.py multiple times without -d|--delete_index option, the indexation process will omit the web entities listed by id in logs/we_id_done.log The defautl behaviour is thus to resume any previous unfinished indexations.

logs

Hyphe2solr logs into 3 log directories :

Hyphe2solr outputs the ids of indexed web entities in :

When using -d or --delete_index option, the script clears all the logs.