mitre / rhapsode

Advanced desktop search/corpus exploration prototype
Other
21 stars 4 forks source link

NOTE: Active development of this project has moved to https://github.com/tballison/rhapsode. The namespaces in the new repo and on maven central have been converted from 'org.mitre' to 'org.tallison'. This repository is no longer actively maintained.

Rhapsode

Advanced* desktop search/corpus exploration prototype

News

Initial release 0.3.2-BETA is now available.

Quick Start

Prerequisite: Java >= 8 needs to be installed and callable from the command line

1) Unzip the latest release. 2) Put documents to search in the "input" directory. 2) Run 01_buildIndex.(bat|sh). 3) Once that finishes, close out the command window and run 02_startRhapsodeDesktop.(bat|sh). 4) Leave the searcher command window open, open a browser and navigate to http://localhost:8092/rhapsode/admin/collection 5) Select "collection1" and click "Open". 6) Click on "Search Tools".

Enjoy!

Much more work remains. :)

Background

The vast majority of search -- web, site and intranet -- is focused on helping users find the most relevant document, the best piece of information or the best product to meet their need. Learning to Rank** and other machine learning methods are revolutionizing relevance ranking for intranet and site search.

There are other types of search, which I'll broadly categorize here as exploratory search, that don't appear to be well supported among some of the mainstream search tools. In exploratory search, the goal is to make sense of what is in a document set -- while it would be useful for a patent examiner to find the existing patents most relevant to the one under consideration, s/he really does need to go through all existing patents that contain related and relevant terms/concepts. Legal analysts, journalists, linguists, literary scholars and many other analytical fields often require tools for this type of search, and I list several good ones below.

Another key differentiator between traditional search and exploratory search is that exploratory search may include making sense of very long documents. While the three best snippets might be useful to determine if a document is relevant, it would be really useful for explorers to be able to see every time their search term appears even in lengthy documents -- with enough context, perhaps they don't even need to open the document.

Another differentiator is the user's interest and capability in crafting complex queries. In traditional search, thanks to Google, many intranet searchers don't even want to bother with double-quotes or boolean operators. In exploratory search, users (or knowledge managers behind the scenes) are willing to construct some pretty elaborate queries.

In traditional search, the system should help the user find "the right spelling", because authoritative/desired sites typically spell things correctly. In exploratory search, the user wants to find all variants, even in noisy OCR.

Goal of Rhapsode

The goal of Rhapsode and of open-sourcing Rhapsode is not to corner the market for this type of search or even, frankly, to build a community around it.

The goal is to demonstrate the utility of the concordance as well as the results matrix in the hope that these ideas and code (?) might make it into other libraries and other tools.

As a first step, adoption into Apache Lucene/Solr and Elasticsearch would be great.

Other exploratory types of tools might also benefit from adopting some capabilities available in Rhapsode:

and... please help me fill out this list!

Search consultants and developers, such as Lucidworks, Basis Technology, Flax and OpenSourceConnections, might find these capabilities useful for specific (and likely rare) clients.

E-Discovery tools including...?

In short, Rhapsode is not the solution for exploratory search, rather a prototype to communicate the ideas for others to adopt. Nevertheless, it can be useful on its own as is.

Features

Caveat

For too long, nested SpanQueries have been buggy. See LUCENE-7398 and please help solve that.

Documentation/References

See an initial draft of a Users Guide under here.

See our upcoming JASIST article: "Collaborative Exploratory Search for Information Filtering and Large-Scale Information Triage". A free preprint of the article is available here and here. Note that the publisher’s version of the article is here for those with access to Wiley journals, or for those interested in viewing only the publication metadata.

License

Basically, Apache Software License 2.0.

Other

A rhapsode was a bard in ancient Greece, who wove together elements from tradition to tell a new(ish) story.
Exploratory searchers weave together disparate pieces of information to carry out analysis and develop new insights.

Notes

* Advanced -- well, right, no fancy deep learning with blockchain convnets, but some tools that are useful if you're trying to do more with a collection of documents than finding the best one for your need.

** Bloomberg's Learning to Rank module in Apache Solr, Doug Turnbull/OpenSource Connection's Elasticsearch LTR plugin and Lucidworks' Fusion platform, among others.

** TFIDF of co-occurring terms is a low-cost way of identifying important collocations/co-occurrences. Currently looking into integrating word2vec...who isn't? :)