Postgres Database Optimization

greenelab / connectivity-search-backend

Django backend for hetnet connectivity search

https://search-api.het.io

BSD 3-Clause "New" or "Revised" License

6 stars 2 forks source link

Postgres Database Optimization #60

Closed dongbohu closed 5 years ago

dongbohu commented 5 years ago

@dhimmel and @vincerubinetti: I added a few indexes in identifier and name fields in Node table on a clone of current backend DB. These indexes are supposed to make prefix search on identifier and substring/trigram searches on name field up to ten times faster (from a few hundred milliseconds to ~10 milliseconds). I am running the optimized DB on test AWS EC2 instance: http://35.175.113.38/v1/nodes/?search=xxx

Please replace xxx with whatever string you want to search and compare its performance with the production server: https://search-api.het.io/v1/nodes/?search=xxx and tell me whether you feel any difference. If you do, I will apply these indexes on the production DB. Thanks.

dhimmel commented 5 years ago

Because the node table is so small ( < 50,000 nodes), there will be very little cost to indexing them right? So if this could even theoretically speed up search, I think we should add it.

dongbohu commented 5 years ago

@dhimmel: Yes, these indexes can be added/deleted/updated in a few seconds.

vincerubinetti commented 5 years ago

I'd agree. Seems like it'd help. As long as it doesn't hurt, I'm for it.

dongbohu commented 5 years ago

Here is a SQL script that adds the extra indexes. It should be run after the database has been populated.

/* Add extra indexes to speed up searching in "node" and "metanode" tables. */
-- Enable "pg_trgm" extension
CREATE EXTENSION IF NOT EXISTS pg_trgm;

-- Create GIN index for substring and trigram searches of "name" field in "node" table
CREATE INDEX name_trgm_idx ON dj_hetmech_app_node USING GIN (name gin_trgm_ops);

-- Create GIN index for prefix search of "identifier" field in "Node" table
CREATE INDEX identifier_trgm_idx ON dj_hetmech_app_node USING GIN (identifier gin_trgm_ops);

-- Create default (B-tree) index for exact match of "abbreviation" field in "metanode" table
CREATE INDEX abbr_idx on dj_hetmech_app_metanode (abbreviation);

It can be run by: psql -U <username> -h <hostname> -f <this_script> -d <db_name>

dongbohu commented 5 years ago

I added these indexes on production database.

dhimmel commented 5 years ago

@dongbohu is this not something that can be integrated into models.py? All of the other index specification occurs there. Would be great if this could be integrated into the django workflow somehow, right?

dongbohu commented 5 years ago

It's possible to integrate these indexes in models.py, but usually the indexes should be added after database is populated. See a discussion here: https://dba.stackexchange.com/questions/66182/index-creation-before-or-after-loading-data

dongbohu commented 5 years ago

I had bad experience defining indexes before loading the data in.

dhimmel commented 5 years ago

The main reason to adding the index after the database has been created seems to be speed to create them. However, since the Nodes table is small that doesn't seem to be an issue here?

My worry is not whether the indexes are creating during db population or afterwards. My worry is to have their creation be completely separate from the hetmech-backend codebase. Is there an automated way to build these indexes? Should these be part of the populate_database management command?

dongbohu commented 5 years ago

When index was generated while the table is constantly updated, not only is the loading speed slowed down, but the index bloated and became non-optimal. That's why Postgres also provides reindex command. (Since Node table is small, I confess that maybe I am a little paranoid.)

populate_database management command sounds a good place to add these indexes.