An open-source medical search engine
Download Apache Tomcat and Apache Solr 4.X
Copy example/solr to your SOLR_HOME and replace solrconfig.xml, schema.xml and synonyms.txt from collection1
Copy solr_configuration/solr.xml to CATALINA_HOME/conf/Catalina/localhost (point to your SOLR_HOME and Solr WebApp)
Copy all jars from example/lib/ext into your container's main lib directory (CATALINA_HOME/lib)
Copy solr_configuration/collection2 to SOLR_HOME/solr and use the Core Admin to add the core
The folder retrieval_and_indexing contains scripts for fetching external content, indexing it in Solr, as well
as creating synonym mappings. The scripts should be called from the command line (e.g.,
with a command such as "php start_crawl.php"). Prior to running the scripts, please ensure that the CURL & SQLite modules and a JAVA JRE are installed (in Ubuntu simply do a sudo apt-get install php5-curl php5-sqlite openjdk-8-jre
).
Define constants SOLR_URL
and SOLR_URL_DIC
in config.php
to point to the Solr cores.
optional
Run pubmed_oa_fetch.php
to fetch OA article IDs from the PMC OA Web Service
Run pubmed_oa_remove_duplicates.php
to remove duplicate IDs
Run pubmed_oa_insert_into_db.php
to insert all IDs in a SQLite3 DB
Run pubmed_fetch.php
:
pubmed_fetch.php downloads all PubMed entries that are the result of a specific PubMed
query. You can edit the script to change the PubMed query used. The default query aims
to cover all content in PubMed that is of relevance for medical decision making.
Run pubmed_xml_to_solr_xml.php
:
pubmed_xml_to_solr_xml.php iterates through the PubMed XML files downloaded by
pubmed_fetch.php, reads PubMed entries, and writes extracted content to the Solr index.
optional (get a free API key and store it in the property file config.php
)
If a new list of MeSH Descriptors (preferred terms only) is available at http://www.nlm.nih.gov/mesh/filelist.html download the list to the wikipedia folder an run generate_new_english_terms.sh mshd20XX.txt
Run wikipedia_create_list_of_relevant_articles.php
Run wikipedia_remove_duplicates.php
to remove duplicate IDs
Run wikipedia_langlinks_translate_de.php
to use the Wikipedia langlink as the german
translation of the article
Run wikipedia_translate_de.php
to
translate the remaining article titles with Yandex to german
(Powered by Yandex.Translate)
Run wikipedia_translate_de.php
again if translation was reached
Run wikipedia_langlinks_translate_es.php
to use the Wikipedia langlink as the spanish
translation of the article
Run wikipedia_translate_es.php
to
translate the remaining article titles with Yandex to spanish
(Powered by Yandex.Translate)
Run wikipedia_translate_es.php
again if translation was reached
Run wikipedia_fetch.php
to download articles from Wikipedia
Run wikipedia_xml_to_solr_xml.php
:
wikipedia_xml_to_solr_xml.php iterates through the Wikipedia XML files downloaded by
wikipedia_fetch.php, reads Wikipedia entries, and writes extracted content to the Solr index.
Run wikipedia_translations_to_solr_xml.php
:
wikipedia_translations_to_solr_xml.php indexes the translations to the dictionary Solr core
start_crawl.php
create_synonyms_from_wikipedia.php
:
This creates a synonym mapping (for improving quality of search results) based on
page redirects in Wikipedia. To do this, it calls the DBpedia server (an open
database where content from Wikipedia can be queried). The script writes synonyms to a file
in the ./synonyms subfolder. This file must be placed into the Solr directory of the Solr
collection containing the index., e.g., as [SOLR_HOME]/collection1/conf/synonyms.txt
.clean_index.php
:
This simple script removes document from the index that match a certain Solr query.
Can be used to clean up unwanted stuff that slipped through in earlier indexing
steps.Set the log level and the name of the file to write to in www/logger_config.xml
.