This PR includes dictionary rules in the published artefact ingested by the Checker service.
It also includes some fixes to problems caused by the much larger number of rules now included in the Rule Manager.
Adds .fetchSize(1000) call to some database methods that select a very large number of rows. According to the scalikeJDBC docs:
the PostgreSQL JDBC driver does infinite(!) caching for result sets if fetchSize is set to 0 (the default) and this causes memory problems.
Setting an explicit .fetchSize solved the outOfMemory errors we encountered in these cases. Testing sizes of 100, 1000, 10,000 and 100,000 rows led to no significant performance differences.
We also saw an outOfMemory error for ruleJson.toString.getBytes(java.nio.charset.StandardCharsets.UTF_8.name). Avoiding the intermediate toString step by using Json.toBytes(ruleJson) resolved the error. This highlights the increased possibility of previously acceptable inefficiencies leading to problems now that we are handling great deal more rules.
Separately, we encountered an odd issue where a duplicated word in the dictionary ended up with an empty string as its pattern in our live table. We didn't get to the bottom of the mechanism behind this, but added a words.distinct.filterNot(_ == "") filter to our word list to resolve the problem.
How to test
Run the application locally according to the instructions in the readme. Make sure you run the setup script to pull the dictionary xml files locally.
Hit the /api/refreshDictionary endpoint with a POST request (e.g. in Postman, with cookies from a valid browser request)
Check the artefact in the your localstack instance, e.g. with these commands run in the Docker localstack_main container CLI to pull, pretty print, and find rows containing 'DictionaryRule':
What does this change?
This PR includes dictionary rules in the published artefact ingested by the Checker service.
It also includes some fixes to problems caused by the much larger number of rules now included in the Rule Manager.
Adds
.fetchSize(1000)
call to some database methods that select a very large number of rows. According to the scalikeJDBC docs:Setting an explicit
.fetchSize
solved theoutOfMemory
errors we encountered in these cases. Testing sizes of 100, 1000, 10,000 and 100,000 rows led to no significant performance differences.outOfMemory
error forruleJson.toString.getBytes(java.nio.charset.StandardCharsets.UTF_8.name)
. Avoiding the intermediatetoString
step by usingJson.toBytes(ruleJson)
resolved the error. This highlights the increased possibility of previously acceptable inefficiencies leading to problems now that we are handling great deal more rules.Separately, we encountered an odd issue where a duplicated word in the dictionary ended up with an empty string as its pattern in our live table. We didn't get to the bottom of the mechanism behind this, but added a
words.distinct.filterNot(_ == "")
filter to our word list to resolve the problem.How to test
/api/refreshDictionary
endpoint with a POST request (e.g. in Postman, with cookies from a valid browser request)