Column drops with Rails

pdmholden commented 6 years ago

What is the suggested way of handling column drops with gh-ost, when running a Rails application? We've run into problems with this, because of the Rails schema cache. After the cut-over, Rails gets confused because it still has the old table schema in its cache and as a result starts throwing exceptions.

Since GitHub is also using Rails, I thought this would be a good place to ask. 😛

Two simple ways of dealing with the problem are (1) restart the Rails server after the cut-over or (2) explicitly invalidate the schema cache after the cut-over. Neither of these is desirable, because they couple the gh-ost migration to the Rails application, and de-coupling these is one of the reasons we started using gh-ost.

Another suggestion we have is to remove eager loading on the relevant queries, before the cut-over. Personally I'm not too familiar with eager loading, but it would fit with our process, which requires making the application code capable of working with or without the column that is being dropped, before the cut-over happens.

pdmholden commented 6 years ago

PS: I don't appear to have permission to add the "question" label to this issue.

zmoazeni commented 6 years ago

@pdmholden I can share how my team handles it (not GitHub).

Before we run the migration(s). We deploy code to production that looks like this:

class MyModel < ActiveRecord::Base
  self.ignored_columns = %w(column_to_be_dropped)
...

At that point, the Rails code shouldn't be using that column anyway. That deploy can go to production right away and from the Rails perspective, the column doesn't exist (even though it still lives in the MySQL table).

We also prepare a second PR that removes that ignored_columns line and creates a Rails migration, but we don't deploy that yet.

Then we gh-ost the migration. After the migration is done via gh-ost, we update the schema_migrations with the id of a migration we have prepared.

Finally, we deploy the prepared PR. The Rails migration noops because it's in the schema_migrations table, but it's handy for staging and other developer's local environments.

We do use the cached schema, but that's to avoid the table definition lookups. We noticed we were causing a thundering herd on deploy/restarts (every Rails process would query the same thing). I have noticed that rolling restarts exacerbated the cached schema / rails conflicting issue. But this strategy has worked very well for us the past couple years.

We use this strategy for adding new columns too. We were having a different-but-same kind of errors with the two processes detecting different versions of the same table across a restart. That may have gone away with the cached schema, I haven't verified that yet.

I've considered blogging this, but it felt a little too specific to our use-case. Hope this helps you though.

zmoazeni commented 6 years ago

Oh one other thing I'll mention on the schema caching. We have our deploy processes generate that at deploy time before our unicorn processes restart.

That makes a little bit of a difference than if it was committed to source control. But caching or not, this 3-step process has worked great for us.

tomkrouper commented 6 years ago

@ikewalker does that sound like what we essentially do at GitHub? It does to me, but I think you spend more time in that area.

ikewalker commented 6 years ago

@tomkrouper the process we follow for dropping columns at GitHub is pretty much the same as what @zmoazeni is doing, but since we're not on Rails 5 yet we can't use ignored_columns (and thus have to use super.reject instead).

(We also verify that there are no SQL queries running outside of ActiveRecord that reference the column before we drop it)

So our process is basically:

🚢 a PR to ignore the column
Verify that no queries are referencing the column
Drop the column using gh-ost
🚢 another PR to remove the super.reject code from the model
:tada:

zmoazeni commented 6 years ago

Verify that no queries are referencing the column

Out of curiosity, how are you all verifying this? ProxySQL or temporarily have long_query_time=0?

That sounds smart and it may be something we check going forward too.

ikewalker commented 6 years ago

@zmoazeni we use VividCortex so we can search all of the queries running in production.

zmoazeni commented 6 years ago

@ikewalker Sorry for the followup. Do you know how the VividCortex query agent detects running queries? Is it keeping tabs on show full processlist? The only other ways I know would be to funnel all queries through a SQL-aware proxy (e.g. ProxySQL) or log them all as slow queries.

I'm wondering if there's another option I'm oblivious to that my team can take advantage of.

ikewalker commented 6 years ago

@zmoazeni it is neither using show full processlist nor a proxy, it's using TCP packet inspection

pdmholden commented 6 years ago

@zmoazeni: Your process sounds the same as ours. Our Rails 5 upgrade is in progress, so we can't use .ignored_columns but @ikewalker's suggestion of using super.reject makes sense as an interim alternative.

Thank you all for your responses. This helps us integrate gh-ost into our process.

zmoazeni commented 6 years ago

@ikewalker Amazing. Thanks!

github / gh-ost

Column drops with Rails #579