openSNP / snpr

The sources of the openSNP website
http://opensnp.org
MIT License
174 stars 46 forks source link

Implement counter_cache where necessary #103

Closed philippbayer closed 10 years ago

philippbayer commented 10 years ago

Related to #102

We should use counter_caches where we count things to improve speed, for example on the genotypes-index

tsujigiri commented 10 years ago

Most importantly: user_snps. Just for the record. :)

philippbayer commented 10 years ago

Added in 8499c6a912 for user_snps - now all SNPs have an extra column of user_snps_count which is updated once a user-snp is removed or added. Note that in calculating the total number of user_snps we currently still have to iterate over all SNPs.

Not sure how to do the same for genotypings. Since counter_cache is only really useful when you have a one-to-many relationship where the "many" are truly "many", and we have mostly a one-to-one relationship between users and genotypings using counter_cache in that relation makes little sense.

I haven't run the migration on the server yet, even on my development machine with one full genotyping it took about 10 minutes to finish the migration :P

Edit: Just realized that on creation of SNP from the parsing job, the user_snps_counter is not incremented. Possibly due to the UserSnp.import trick? Fixed in 8f20d91293

tsujigiri commented 10 years ago

I haven't run the migration on the server yet, even on my development machine with one full genotyping it took about 10 minutes to finish the migration :P

We could load a hash with { snp_name => count } first, which would only require one query, even though it might be a long one, and update the counter caches manually. Maybe we stop Sidekiq for the time in order to not change the amount while counting and not overload the DB.

For the genotypings we would either have to implement something ourselves or find something that already does, I guess.

tsujigiri commented 10 years ago
 explain select snp_name, count(*) from user_snps group by snp_name;

Tells me that it would be an index-only scan, meaning, it might not be that slow after all.

tsujigiri commented 10 years ago

I'm also not sure if we really should use Snp.find_each here, or at leat increase the batch size. With a default batch size of 1000 this would be 2000 queries.

tsujigiri commented 10 years ago

On the other hand we save two million records individually... So, who cares about 2000 more or less. :D

philippbayer commented 10 years ago

Can't currently run the migrations, openSNP is growing by roughly 3 genotypings per day, and Sidekiq just decided to die (no workers, had to restart) - I don't think we can wait for an empty queue, currently the parse-queue has 200 entries and I don't think that'll slow down

tsujigiri commented 10 years ago

Apparently, adding a column without a default value doesn't take very long on Postgres, as it doesn't have to rewrite the whole table. So, if we remove the :default => 0 here, we should be fine. 0 is a wrong value anyway.

philippbayer commented 10 years ago

Done in 254f52515d82529f0a7940b7e0fc0c64eb406f44

philippbayer commented 10 years ago

All that's missing is the deployment