socrata / odn-backend

Backend for the Open Data Network.
Other
9 stars 6 forks source link

Radix Trees to replace Entity and Question Autosuggest #65

Closed aaasen closed 8 years ago

aaasen commented 8 years ago

One big pain point in adding new data to the ODN is updating the autosuggest datasets. The entities autosuggest is somewhat maintainable, but the questions autosuggest is a real mess. It uses a variety of hacks to get Socrata autosuggest to do things that it simply isn't meant to do.

I decided to experiment with using in-memory radix trees for autosuggestion.

Replacing entity suggestion was pretty simple. I built a radix tree containing the names of all entities and then used prefix queries to get all entities matching a given prefix, ranked in descending order by population.

The more interesting problem was replacing questions. To do this, I first take the query and find all of the important words:

What is the population of seattle? => ['population', 'seattle']

I have two radix trees: one for entity names and one for variable names. I perform a prefix query on each tree for each word to get a list of variables and a list of entities related to the query. I ignore the results of the word if there are too many so that short words with many completions do not corrupt the results.

Then, I take the top n entities and the top n variables and find the combinations that we have data for. This is time most time consuming part of the process because it takes a SOQL query. Finally, I return each variable-entity combination which can be phrased as a question by the client.

Overall, this approach works very well. It has many advantages over time current system:

The only real disadvantage is that it requires storing the entire entity radix tree in memory, which is about ~250MB. I'm going to create a review app to see how this will affect the server and make some tweaks if necessary.

aaasen commented 8 years ago

Deployed this change to staging to check memory usage. Before this change, backend staging was hovering at about 40MB memory usage. That is now up to 330MB, so this change cause an increase in memory usage of around 300MB. That's a lot, but it is still below the 500MB soft memory quota. Production is hovering at around 100MB of memory usage, and I expect that will jump to 400MB once I deploy this change. There is definitely some low hanging fruit that I'm going to try to optimize.