A simple python/django search multiplexing backend for use in a bento-style frontend for GW Libraries.
Developed using Python 2.7, Django 1.8, and PostgreSQL 9.3 on Ubuntu 14.04.
The current production Obento endpoint is at http://gwbento-prod.wrlc.org:8080 . This can only be accessed directly from within the GW network.
Query API endpoints are as follows:
/
- returns a results page with six boxes of results (seven if Best Bets contains a result)
/articles_html
- returns an HTML page with only the articles results
/articles_json
- returns article results as JSON
/books_media_html
- returns an HTML page with only the books & media results (wrapper for /launchpad_html
)
/books_media_json
- returns books & media results as JSON (wrapper for /launchpad_json
)
/databases_solr_html
- returns an HTML page with database results retrieved by querying Solr
/databases_solr_json
- returns database results retrieved by querying Solr, as JSON
/journals_solr_html
- returns an HTML page with journal title results retrieved by querying Solr
/journals_solr_json
- returns journal title results retrieved by querying Solr, as JSON
/launchpad_html
- returns an HTML page with launchpad query results
/launchpad_json
- returns launchpad query results as JSON
/summon_html
- returns an HTML page with Summon query results with a scope of all
as defined in settings
/summon_json
- returns Summon query results as JSON, with a scope of all
as defined in settings
/summon_books_media_html
- returns an HTML page with Summon query results with a scope of books_media
as defined in settings
/summon_books_media_json
- returns Summon query results as JSON, with a scope of books_media
as defined in settings
/research_guides_html
- returns an HTML page with Summon query results with a scope of research_guides
as defined in settings
/research_guides_json
- returns Summon query results as JSON, with a scope of research_guides
as defined in settings
/best_bets_html
- returns an HTML page with Summon query results with a scope of best_bets
as defined in settings
/best_bets_json
- returns Summon query results as JSON, with a scope of best_bets
as defined in settings
/libsite_html
- returns an HTML page with library website query results
/libsite_json
- returns library website query results as JSON
Request parameters for the Query endpoints are as follows:
q (optional): query string, e.g. http://gwbento-prod.wrlc.org:8080?q=computer+science
count (optional): the number of results to return in each box, e.g. http://gwbento-prod.wrlc.org:8080?q=computer+science&count=4 . This overrides the default value configured in local_settings.py
.
ignoresearch (optional): if true
, do not record the query in the Searches table, e.g. http://gwbento-prod.wrlc.org:8080?q=computer+science&ignore=true . If this parameter is not specified, the query will be recorded.
/searches
- returns an HTML page containing:
Request parameters for the Searches endpoint are as follows:
-
prefix indicates reverse sort. Sorting can also be accomplished by clicking on the column header. Values may include:
q
/ -q
sort by query textdate_searched
/ -date_searched
articles_count
/ -articles_count
books_count
/ -books_count
database_count
/ -database_count
journals_count
/ -journals_count
researchguides_count
/ -researchguides_count
Install Apache, OpenJDK8 and other dependencies
$ sudo apt-get install apache2 libapache2-mod-wsgi libaio-dev python-dev python-profiler postgresql postgresql-contrib libpq-dev git libxml2-dev libxslt-dev python-setuptools python-virtualenv
$ sudo add-apt-repository ppa:openjdk-r/ppa
$ sudo apt-get update
$ sudo apt-get install openjdk-8-jdk
Install Chrome and ChromeDriver (needed for loading and scraping the libguides databases list)
$ wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb
$ sudo dpkg -i google-chrome-stable_current_amd64.deb
If you encounter errors related to missing dependencies, then use:
$ sudo apt-get -f install
Now proceed with ChromeDriver installation:
$ sudo apt-get install unzip
$ wget https://chromedriver.storage.googleapis.com/2.38/chromedriver_linux64.zip
$ unzip chromedriver_linux64.zip
$ sudo mv chromedriver /usr/local/bin
$ sudo chown root:root /usr/local/bin/chromedriver
Prepare Java JVM symlink for Jetty
Create a symlink to the java jvm
$ sudo mkdir /usr/java
$ sudo ln -s /usr/lib/jvm/java-8-openjdk-amd64 /usr/java/default
Download Jetty and unzip.
$ cd /opt
Go to http://www.eclipse.org/jetty/download.html and copy the link to the .tar.gz version of the latest download of Jetty 9. Use this link in the following wget command to download the .tar.gz file (again, the URL may change):
$ sudo wget -O jetty.gz "http://central.maven.org/maven2/org/eclipse/jetty/jetty-distribution/9.4.2.v20170220/jetty-distribution-9.4.2.v20170220.tar.gz"
$ sudo mkdir jetty
$ sudo tar -xvf jetty.gz -C jetty --strip-components=1
Create jetty user and make it the owner of /opt/jetty
$ sudo useradd jetty -U -s /bin/false
$ sudo chown -R jetty:jetty /opt/jetty
Set up jetty to run as a service
$ sudo cp /opt/jetty/bin/jetty.sh /etc/init.d/jetty
Create the jetty settings file
$ sudo vi /etc/default/jetty
Paste the following into the file, and save it:
JAVA=/usr/bin/java
NO_START=0 # Start on boot
JETTY_HOST=0.0.0.0 # Listen to all hosts
JETTY_ARGS=jetty.port=8983
JETTY_USER=jetty # Run as this user
JETTY_HOME=/opt/jetty
In production, jetty should be running on a port that won't be publicly exposed. In development and testing, exposing Solr might be helpful; never expose it in production.
NOTE: In the step above, JAVA is set to /usr/bin/java. When upgrading from an environment that had Java 7 installed, /usr/bin/java may be a symbolic link (...to another symbolic link) which still points to a Java 7 JRE. If that is the case, reconfigure to ensure that either /usr/bin/java resolves to a Java 8 JRE, or point JAVA in the jetty config file to wherever the Java 8 JRE is.
Start jetty
$ sudo service jetty start
This should return something that starts with:
Starting Jetty: OK
A possible cause for a failed Jetty start is that /var/run/jetty
and contents may need to be owned by the jetty user. To set jetty:jetty
as the owner, use sudo chown jetty:jetty /var/run/jetty
Verify that MYSERVER:8983 returns a page that is "Powered by Jetty" (even if it is a 404-Not Found page)
Add jetty to startup
$ sudo update-rc.d jetty defaults
Download and unzip solr
Go to http://archive.apache.org/dist/lucene/solr/4.10.4/ and copy the link to the .tgz version of Solr 4.10.4. Use this link in the following wget command to download the .tgz file (again, the URL may change). This may also require a --no-check-certificate option as well, depending on the download site:
$ sudo wget -O solr.gz "http://archive.apache.org/dist/lucene/solr/4.10.4/solr-4.10.4.tgz"
$ sudo tar -xvf solr.gz
Copy solr contents:
$ sudo cp -r solr-4.10.4/example/solr /opt
$ sudo cp -r solr-4.10.4/dist /opt/solr
$ sudo cp -r solr-4.10.4/contrib /opt/solr
Copy ICU Tokenizer jars to /opt/solr/lib
$ sudo mkdir /opt/solr/lib
$ sudo cp /opt/solr/contrib/analysis-extras/lib/icu4j-*.jar /opt/solr/lib
$ sudo cp /opt/solr/contrib/analysis-extras/lucene-libs/lucene-analyzers-icu-* /opt/solr/lib
Copy solr .war and .jar files to jetty
$ sudo cp /opt/solr/dist/solr-4.10.4.war /opt/jetty/webapps/solr.war
$ sudo cp solr-4.10.4/example/lib/ext/* /opt/jetty/lib/ext
Ensure that these are now owned by jetty:
$ sudo chown -R jetty:jetty /opt/jetty
Update jetty settings
$ sudo vi /etc/default/jetty
Append the following line:
JAVA_OPTIONS="-Dsolr.solr.home=/opt/solr $JAVA_OPTIONS"
Change the owner of the solr folder and contents to jetty
$ sudo chown -R jetty:jetty /opt/solr
Change collection1
in solr to obento
:
$ cd /opt/solr
$ sudo mv collection1 obento
Replace name=collection1
with name=obento
in core.properties:
$ sudo vi obento/core.properties
Restart jetty
$ sudo service jetty restart
Create a directory for your projects (replace <OBENTO_HOME> with
your desired directory path and name: for instance /obento
or
/home/<username>/obento
)
$ mkdir <OBENTO_HOME>
$ cd <OBENTO_HOME>
Pull down the project from github
(GW staff only)
$ git clone git@github.com:gwu-libraries/obento.git
(everyone else)
$ git clone https://github.com/gwu-libraries/obento.git
Create virtual Python environment for the project
$ cd <OBENTO_HOME>/obento
$ virtualenv --no-site-packages ENV
Activate your virtual environment
$ source ENV/bin/activate
Upgrade to the latest pip and install project dependencies
(ENV)$ pip install pip --upgrade
(ENV)$ pip install -r requirements.txt
If the previous step encounters problems installing pytz, then it can be installed as follows
easy_install --upgrade pytz
Create a database user for django (and make a note of the password you create). A name for MYDBUSER might be something like obentouser_m1
(m1 for milestone 1)
$ sudo -u postgres createuser --createdb --no-superuser --no-createrole --pwprompt MYDBUSER
Create a database for the obento application. A name for MYDBNAME might be something like obi_m1
$ sudo -u postgres createdb -O MYDBUSER MYDBNAME
Copy the local settings template to an active file
$ cd obento/obi/obi
$ cp local_settings.py.template local_settings.py
Update the values in the local_settings.py
file: for the database, NAME
, USER
, and PASSWORD
to the database you created above, and set ENGINE
to 'postgresql_psycopg2'; also, set a SECRET_KEY
. Ensure that the port number in SOLR_URL
matches JETTY_PORT
configured earlier in /etc/default/jetty
. Provide a LOG_FILE_PATH
and create a corresponding logs directory in the obento top directory.
$ vi local_settings.py
Copy the WSGI file template to an active file
$ cp wsgi.py.template wsgi.py
Update the wsgi.py file. (Uncomment the virtualenv settings starting with "import site" and Change the value of ENV to your environment path)
$ vi wsgi.py
Initialize database tables. WARNING: Be sure you are still using your virtualenv. DO NOT create a superuser when prompted!
(ENV)$ cd <OBENTO_HOME>/obento/obi
(ENV)$ python manage.py migrate
If you encounter an authentication error with postgresql edit your local_settings.py file and set HOST = 'localhost'
If you encounter an error during the above command that ends with:
TypeError: decode() argument 1 must be string, not None
Then you need to add location values to your profile. Open your .bashrc file in an editor:
$ vim ~/.bashrc
Enter the following values at the end of the file and save.
export LC_ALL=en_US.UTF-8
export LANG=en_US.UTF-8
Now, reload your bashrc changes:
source ~/.bashrc
Now, rerun the migrate command.
(ENV)$ python manage.py migrate
Copy the Apache virtual host file to the Apache2 directory
$ cd /<OBENTO_HOME>/obento
$ sudo cp apache/obento /etc/apache2/sites-available/obento.conf
Configure self signed SSL certificates. Refer https://github.com/gwu-libraries/SSL_HowTo
If you choose to run obento in apache (versus django runserver):
Update the values in the Apache virtual host file.
Edit the host port number Edit your server name (base url) Edit the many instances of <path to OBENTO_HOME>. Beware: the line for the WSGI Daemon has two references to that path.
$ sudo vi /etc/apache2/sites-available/obento.conf
To change all of the path values at once use the global replace command in vim
:%s/old_value/new_value/g
Enable the apache headers module, this is required for CORS support.
$ sudo a2enmod headers
Enable the new virtualhost. If you are using port 80 also disable the default host
$ sudo a2ensite obento
$ sudo a2dissite default
$ sudo service apache2 restart
Copy the obento solr configuration files to solr
$ sudo cp -r /<OBENTO_HOME>/obento/obi/obi/conf /opt/solr/obento/
Restart jetty
$ sudo service jetty restart
To load GW's list of databases from libguides, first configure
local_settings.py
with a list of libguides page sids.
Then, to load/parse/add databases from these pages to the database:
$ ./manage.py load_databases
To verify that the databases loaded, try querying the html or json view:
http://<OBENTO_URL>/databases_html?q=proquest
http://<OBENTO_URL>/databases_json?q=proquest
To index the list of databases in Solr:
$ ./manage.py index_databases
Test that indexing worked with this path:
http://<OBENTO_URL>/databases_solr_html?q=proquest
http://<OBENTO_URL>/databases_solr_json?q=proquest
The results should look different from the test above.
To load the Excel-formatted extract of journal titles:
$ ./manage.py load_journals <JOURNALS_EXCEL_FILE>
To verify that the journal titles loaded, try querying the html or json view:
http://<OBENTO_URL>/journals_html?q=science
http://<OBENTO_URL>/journals_json?q=science
To index the list of journals in Solr:
$ ./manage.py index_journals
Test that indexing worked with this path:
http://<OBENTO_URL>/journals_solr_html?q=science
http://<OBENTO_URL>/journals_solr_json?q=science
The results should look different from the test above.
You may also wish to have a cron job reload (and reindex) the databases list on a regular basis. To accomplish this, you can add a line in your crontab similar to this:
0 2 * * * <PATH TO YOUR APP>/obento/ENV/bin/python <PATH TO YOUR APP>/obento/obi/manage.py load_databases && <PATH TO YOUR APP>/obento/ENV/bin/python <PATH TO YOUR APP>/obento/obi/manage.py index_databases
This would run load_databases
, then index_databases
every night at 2:00 A.M.