Search API - Githubissues

mckaydavis commented 7 years ago

This issue is to track the search API.

Modifications and improvements to search related functionality shall be discussed here.

@rhydomako has been looking into importing the HRS_Index and Supplemental index data into ElasticSearch.

rhydomako commented 7 years ago

First off, loading the HRS json files into elasticsearch:

backend/bin $ docker-compose up elasticsearch1
backend/bin $ ./load_hrs_into_elasticsearch.sh

This will combine all the json files into one large file that is then ingested by the elasticsearch _bulk endpoint. The json documents are indexed on insertion, and I believe the default is to index all fields.

Individual documents can be retrieved (I've made their IDs the statute reference):

$ curl -u elastic:changeme localhost:9200/hrs/statutes/10-33?pretty
{
  "_index" : "hrs",
  "_type" : "statutes",
  "_id" : "10-33",
  "_version" : 2,
  "found" : true,
  "_source" : {
    "url" : "http://www.capitol.hawaii.gov/hrscurrent/Vol01_Ch0001-0042F/HRS0010/HRS_0010-0033.htm",
    "year" : "current",
    "division" : "1",
    "division_text" : "Government",
    "volume" : "1",
    "title" : "1",
    "title_text" : "General Provisions",
    "subtitle" : null,
    "subtitle_text" : null,
    "chapter" : "10",
    "chapter_text" : "Office of Hawaiian Affairs",
    "article" : null,
    "article_text" : null,
    "part" : "II",
    "part_text" : "Revenue Bonds",
    "subpart" : null,
    "section" : "33",
    "section_text" : "Powers herein, additional to other powers",
    "text" : [
      "    The powers conferred by this part shall be in addition and supplemental to the powers conferred by any other general, special, or local law.  Insofar as this part is inconsistent with any other general, special, or local law this part shall be controlling. [L 1994, c 283, pt of §2(2)]",
      " "
    ],
    "refs" : [ ]
  }
}

The elasticsearch _search API can then be used on the HRS documents directly:

$ curl -u elastic:changeme localhost:9200/hrs/statutes/_search?pretty -d '
{
 "query": { "match": { "chapter_text": "Hawaii" } },
 "_source": ["division_text", "chapter_text", "section_text"],
 "size": 2
}'

{
  "took" : 15,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1953,
    "max_score" : 2.3118572,
    "hits" : [
      {
        "_index" : "hrs",
        "_type" : "statutes",
        "_id" : "414-332",
        "_score" : 2.3118572,
        "_source" : {
          "section_text" : "Sale of assets other than in regular course of business",
          "division_text" : "Business",
          "chapter_text" : "Hawaii Business Corporation Act"
        }
      },
      {
        "_index" : "hrs",
        "_type" : "statutes",
        "_id" : "206E-8.5",
        "_score" : 2.3118572,
        "_source" : {
          "section_text" : "Developments within special management areas and shoreline setback",
          "division_text" : "Government",
          "chapter_text" : "Hawaii Community Development Authority"
        }
      }
    ]
  }
}

mckaydavis commented 7 years ago

Great start @rhydomako!

A few questions looking forward:

Should we expose the ES API directly to the client via a reverse nginx proxy? Or is should this be an internal API exposed to whatever server code that is orchestrating everything? For option 2 we would then provide our own search API that is exposed to the client. cc: @thgaskell
Does anyone know an optimal (but minimal) amount of RAM to give to ES?
For hosting, I'm using Digital Ocean. To run on the droplet w/ 1gb RAM, I had to limit the memory used by ElasticSearch to 384m to also give enough memory for other processes. Considering the size of the hrs .json files are ~100m already, I can easily see this low memory limit being an issue for ES. Note that Digital Ocean Droplet Pricing scales linearally at $10/GB/mo -- so it can get expensive pretty quickly.

rhydomako commented 7 years ago

For 1), as long as we limit it to GET/POST requests to the search endpoint, I don't think there is too much harm in exposing the ES API.

I don't really know what the optimal amount of RAM is, but I agree that the optimal is the smallest amount we can get away with. Just experimenting using my local docker containers, it seems to run ok with a heap of 128m (only the HRS docs loaded so far).

So I would suggest setting it even lower, and adjust that if we run into problems.

thgaskell commented 7 years ago

I think we've ran it on 2GB of ram on a small digital ocean box.

The raw json dump is taking up almost 100MB on local storage, so 128MB heap size is probably not enough with the current structure.

mckaydavis / hrs.plus

Search API #5