Open mckaydavis opened 7 years ago
First off, loading the HRS json files into elasticsearch:
backend/bin $ docker-compose up elasticsearch1
backend/bin $ ./load_hrs_into_elasticsearch.sh
This will combine all the json files into one large file that is then ingested by the elasticsearch _bulk endpoint. The json documents are indexed on insertion, and I believe the default is to index all fields.
Individual documents can be retrieved (I've made their IDs the statute reference):
$ curl -u elastic:changeme localhost:9200/hrs/statutes/10-33?pretty
{
"_index" : "hrs",
"_type" : "statutes",
"_id" : "10-33",
"_version" : 2,
"found" : true,
"_source" : {
"url" : "http://www.capitol.hawaii.gov/hrscurrent/Vol01_Ch0001-0042F/HRS0010/HRS_0010-0033.htm",
"year" : "current",
"division" : "1",
"division_text" : "Government",
"volume" : "1",
"title" : "1",
"title_text" : "General Provisions",
"subtitle" : null,
"subtitle_text" : null,
"chapter" : "10",
"chapter_text" : "Office of Hawaiian Affairs",
"article" : null,
"article_text" : null,
"part" : "II",
"part_text" : "Revenue Bonds",
"subpart" : null,
"section" : "33",
"section_text" : "Powers herein, additional to other powers",
"text" : [
" The powers conferred by this part shall be in addition and supplemental to the powers conferred by any other general, special, or local law. Insofar as this part is inconsistent with any other general, special, or local law this part shall be controlling. [L 1994, c 283, pt of §2(2)]",
" "
],
"refs" : [ ]
}
}
The elasticsearch _search
API can then be used on the HRS documents directly:
$ curl -u elastic:changeme localhost:9200/hrs/statutes/_search?pretty -d '
{
"query": { "match": { "chapter_text": "Hawaii" } },
"_source": ["division_text", "chapter_text", "section_text"],
"size": 2
}'
{
"took" : 15,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 1953,
"max_score" : 2.3118572,
"hits" : [
{
"_index" : "hrs",
"_type" : "statutes",
"_id" : "414-332",
"_score" : 2.3118572,
"_source" : {
"section_text" : "Sale of assets other than in regular course of business",
"division_text" : "Business",
"chapter_text" : "Hawaii Business Corporation Act"
}
},
{
"_index" : "hrs",
"_type" : "statutes",
"_id" : "206E-8.5",
"_score" : 2.3118572,
"_source" : {
"section_text" : "Developments within special management areas and shoreline setback",
"division_text" : "Government",
"chapter_text" : "Hawaii Community Development Authority"
}
}
]
}
}
Great start @rhydomako!
A few questions looking forward:
For 1), as long as we limit it to GET/POST requests to the search endpoint, I don't think there is too much harm in exposing the ES API.
I don't really know what the optimal amount of RAM is, but I agree that the optimal is the smallest amount we can get away with. Just experimenting using my local docker containers, it seems to run ok with a heap of 128m (only the HRS docs loaded so far).
So I would suggest setting it even lower, and adjust that if we run into problems.
I think we've ran it on 2GB of ram on a small digital ocean box.
The raw json dump is taking up almost 100MB on local storage, so 128MB heap size is probably not enough with the current structure.
This issue is to track the search API.
Modifications and improvements to search related functionality shall be discussed here.
@rhydomako has been looking into importing the HRS_Index and Supplemental index data into ElasticSearch.