ilyakatz / data-migrate

Migrate and update data alongside your database structure.
MIT License
1.42k stars 194 forks source link

Does not work with Rails multiple databases #181

Open morgoth opened 3 years ago

morgoth commented 3 years ago

When I configure my Rails 6.1 app to use multiple databases (via built in functionality), data-migrate doesn't work.

bin/rake data:schema:load
my-project/db/secondary_db_data_schema.rb doesn't exist yet. Run `rake data:migrate` to create it, then try again.

but running rake data:migrate does not create it.

Is it a known limitation? Do you plan to support multiple databases?

wili3 commented 3 years ago

I am facing the same issue, did you find any workaround to make this work?

rosanelli commented 3 years ago

me too

ilyakatz commented 3 years ago

@lewhit, i forgot, did you have experience with multi-databases in data-migrate? do you have any ideas on what's happening here?

rosanelli commented 3 years ago

@ilyakatz the gem is trying to update the data_migrations table of the secondary database. the data migration is for the primary database. so the gem seems to be unaware that multiple databases can exist.

kroehre commented 3 years ago

We have this same issue. Has anyone found a workaround?

edit: We were able to get this working with an initializer:

require "data_migrate"

module DataMigrate::OnlyPrimaryDatabase
  def each_current_configuration(environment, name = nil)
    # Ignore secondary database(s)
    super(environment, name || "primary")
  end
end

DataMigrate::DatabaseTasks.extend DataMigrate::OnlyPrimaryDatabase

DataMigrate.configure do |config|
  config.db_configuration = Rails.configuration.database_configuration[Rails.env]["primary"]
end
chaunce commented 1 year ago

I have a branch that appears to properly run migrations across all configured connections, in the same way that schema migrations work. I don't have a lot of time right now to test and fix specs, but if anyone else wants to contribute we could get a pull request submitted once this is cleaned up https://github.com/chaunce/data-migrate/tree/multiple_connection_support

chaunce commented 1 year ago

I have fixed the specs and created a pull request. Any testing and feedback anyone can offer would be helpful https://github.com/ilyakatz/data-migrate/pull/249

martingregoire commented 1 year ago

Thanks for working on this issue! I'm testing the version 10.0.0 of the data_migrate gem, and have some questions.

I created a very simple Rails app for demonstration: https://github.com/martingregoire/data-migrate-multi-db

It has a primary database with a Post model, and a secondary database with a Comment model.

Now I'd like to add a data migration that is run on the Post model, and another one on the Comment model. So I do this:

$ rails g data_migration DoSomethingWithPosts --database primary
create  db/data/20230608080311_do_something_with_posts.rb

$ rails g data_migration DoSomethingWithComments --database secondary
create  db/data/20230608080320_do_something_with_comments.rb

Then to run the data migrations:

$ rails db:migrate:with_data
== Data =======================================================================
== 20230608080311 DoSomethingWithPosts: migrating =============================
== 20230608080311 DoSomethingWithPosts: migrated (0.0024s) ====================

== Data =======================================================================
== 20230608080320 DoSomethingWithComments: migrating ==========================
== 20230608080320 DoSomethingWithComments: migrated (0.0018s) =================

== Schema =====================================================================
== 20230608074207 CreatePosts: migrating ======================================
-- create_table(:posts)
   -> 0.0010s
== 20230608074207 CreatePosts: migrated (0.0010s) =============================

== Data =======================================================================
== 20230608080311 DoSomethingWithPosts: migrating =============================
== 20230608080311 DoSomethingWithPosts: migrated (0.0004s) ====================

== Data =======================================================================
== 20230608080320 DoSomethingWithComments: migrating ==========================
== 20230608080320 DoSomethingWithComments: migrated (0.0004s) =================

What happens is that

The schema migration creates a posts table in the secondary database:

$ sqlite3 db/primary.sqlite3 .tables
ar_internal_metadata
data_migrations
schema_migrations
posts

$ sqlite3 db/secondary.sqlite3 .tables
ar_internal_metadata
data_migrations
schema_migrations
comments
posts

In db/data_schema.rb we see this: DataMigrate::Data.define(version: 20230608080320) In db/secondary_data_schema.rb this: DataMigrate::Data.define(version: 20230608080320)

Am I doing something wrong? Do I have to use a different directory for the data migrations for the secondary database?

The creation of the posts table on the secondary bug looks like a bug to me. And I would expect that for the primary database only the DoSomethingWithPosts data migration was run (setting the version to 20230608080311 in db/data_schema.rb), and that for the secondary database only the DoSomethingWithComments data migration was run.

ilyakatz commented 1 year ago

hey folks, there was a recent by @chaunce update to support multiple databases https://github.com/ilyakatz/data-migrate/pull/249, does this address your concerns?

chaunce commented 1 year ago

It looks like this was tested on version 10.0.0, so it should include the new support for multiple databases.

Regarding your two issues: each data migration is run twice

The test cases added were configured on a mock application using horizontal sharding [https://guides.rubyonrails.org/active_record_multiple_databases.html#horizontal-sharding]. In this case, the data migrations may need to run on all connections. For example, if the User class has records across connections :shard_0, :shard_1, and :shard_2, the data migration will need to run on each of these. On the other hand, if you have role-based connections [https://guides.rubyonrails.org/active_record_multiple_databases.html#activating-automatic-role-switching] you may not want the data migrations to run on all connections. For example, if the User class has a :read connection and a :write connection, you may only want to run the data migration on the :write connection. But connection roles are specified at the AR level, and can even differ from model to model, so it's not inherently clear if or when some connections should be skipped for a data migration.

Since data migrations piggy-back off the scheme migration tasks/code, the implementation I added uses the same configuration, and a data migration that is not built for a specific connection will run on all connections. In the data migration itself, you should be able to control what data and how the data on the current connection is migrated. I didn't test connection-specific data migrations myself (those created using a --database db_name flag), we may need some additional code to support this.

a schema migration is run, but on the wrong database This may be a bug, the quickest way to determine is probably to drop the db, remove the data-migrate gem, then run the schema migration again. If the behavior is the same, this is what is expected. If the behavior is different with data-migrate installed then a bug needs to be fixed.

I can look into these issues sometime, but I don't know exactly when..

martingregoire commented 1 year ago

Thanks @chaunce!

each data migration is run twice

Our use case is neither sharding nor role-based connections; we simply use multiple databases for separation of the storage of records. Imagine the main, important data in the primary database, and some not so important data in the secondary database, which has lower performance or availability.

Do you happen to have an example of how to check for the current connection inside a data migration file?

a schema migration is run, but on the wrong database

Without the data-migrate gem this does not happen: the schema migrations are only run on the correct database.

I created these schema migrations using the --database option:

rails g model Post title:string --database primary
rails g model Comment body:string --database secondary

Running rails db:migrate will create a posts table in the primary DB and a comments table in the secondary DB.

After adding the gem, running rails db:migrate:with_data will create a posts table in the secondary DB.

ngan commented 1 year ago

Hey everyone, at Gusto, we rely heavily on this gem. As of version 10, things just stopped working completely for us. @wildmaples and I have been debugging and have realized the implementation for multiple databases is...not quite right. We think that you actually don't need each database to maintain it's own data migration versions at all. We just need the single primary db to keep track since data migrations can make changes to all databases already.

All the issues that folks are seeing (eg see below), can probably be addressed in other ways.

bin/rake data:schema:load
my-project/db/secondary_db_data_schema.rb doesn't exist yet. Run `rake data:migrate` to create it, then try again.

At this point we're not sure what the intention of creating a data_migrations table in every single database is for since there isn't a dedicated data migrations directory for each database. This is why folks are seeing data migrations running multiple times -- it's once for every database you have.

We have two paths forward:

  1. Give each database its own data migration directory. But I think this concept is overkill. You don't need to segregate your data migrations because it's just a ruby file that executes arbitrary ruby code. And as such, you can make modifications to any database.
  2. Revert the changes in v10 that gives each database its own data_migrations table and go back to only the primary database keeping track of data migrations. We'll just have to fix the original problem that this issue was point out (which should be easy).

We've reached out to @ilyakatz to schedule a Zoom call to figure out what the intention was so we know how to resolve this.

chaunce commented 1 year ago

Having a data_migrations table on every database is necessary for the same reason a schema_migrations table is on every database. If all your databases match you might not need these, but if they don't you probably will. There are some bugs that I plan to look into, but if we remove data_migrations from all but one table then this gem won't work for anyone that does not have identical horizontal shading and only ever wants to run data migrations AR records stored in a database where the connection is configured with a :write role.

ilyakatz commented 1 year ago

Hi folks, I just re-read the comments and I sounds like people use multiple databases for different usecases

  1. sharding - all databases have the same tables with same structure but data is loaded across different databases
  2. data separation - databases have different tables with unrelated data

Are there other usecases?

From what I can tell, @chaunce solved for sharding usecase, correct?

I didn't test connection-specific data migrations myself (those created using a --database db_name flag), we may need some additional code to support this.

I can see why this could be a problem as people are expecting to use it the same way as it's used in Rails, but technically it's not a requirement that it this gem has to operate the same way although not ideal.

After adding the gem, running rails db:migrate:with_data will create a posts table in the secondary DB.

this sounds like a big problem

So where do we go from here?

ilyakatz commented 1 year ago

cc @opti I know you were active more recently, see discussion above

chaunce commented 1 year ago

Some feedback on the previous comment:

it sounds like people use multiple databases for different usecases .. Are there other usecases? The only other major strategies I'm aware of is the separation of Reads and Writes, and non-standard schemas.

However, there's also the combination of multiple strategies. I can use horizontal shading for client data (all the same tables across all databases), but also have Application Specific data only one connection, and separate connections for Reads and Writes. In Rails we just name the connections, but there's nothing inherently different about a Read Only connection and a Write connection (at least currently https://edgeguides.rubyonrails.org/active_record_multiple_databases.html#load-balancing-replicas) except that you configure it that way. Connections can be configured at the application level (ApplicationRecord), or for each Active Record model individually, so each model can have its own connection configuration.

For example:

Using the following database.yml

production:
  primary:
    database: primary
    username: root
  primary_replica:
    database: primary
    username: root_readonly
    replica: true
  extended:
    database: extended
    username: root
    migrations_paths: db/extended
  client_a:
    database: client_a
    username: client_a
    migrations_paths:
        - "db/migrate"
        - "db/clients"
        - "db/client_a"
  client_a_replica:
    database: client_a
    username: client_a_readonly
    replica: true
  client_b:
    database: client_a
    username: client_a
    migrations_paths:
        - "db/migrate"
        - "db/clients"
        - "db/client_b"
  client_b_replica:
    database: client_a
    username: client_a_readonly
    replica: true

And [example - maybe non-functional] application code:

class ApplicationRecord < ActiveRecord::Base
  self.abstract_class = true
  connects_to database: { writing: :primary, reading: :primary_replica } # role specific connections
end

class Clients < ApplicationRecord
end

class ClientDetails
  connects_to database: { writing: : extended, reading: : extended }
  belongs_to :client
end

class Widgets < ApplicationRecord
  belongs_to :client

  scope for_client(client) do
    ActiveRecord::Base.connected_to(role: :reading, shard: :"client_#{client.name}") do
      all
    end
  end
end

So all clients all exist on : primary, they are all replicated to : primary_replica, but : client_a, : client_a_replica, :client_b, and :client_b_replica also have the clients table and may have this data replicated to all these as well, all by way of migrations in db/migrate [default]. The client_details table only exists on :extended by way of db/extended. The widgets table exists on : client_a, : client_a_replica, :client_b, and :client_b_replica by way of db/clients. And migrations in db/client_a and db/client_b add custom columns to tables on :client_a and, separately with different custom columns, :client_b.

Last, if db/migrate has 30 schema migrations, these can each be run on each individual connection at different times. So db/client_a can be at v26 and db/client_b at v30, meaning maybe both different tables and different columns.

This is all fully supported in Rails currently. This might be an uncommon configuration now, but I imagine will become more common in the future.


I can see why this could be a problem as people are expecting to use it the same way as it's used in Rails, but technically it's not a requirement that it this gem has to operate the same way although not ideal.

The way I see it there are two options:

  1. we follow the patterns of Rails and allow data migrations to run on specific connections, thereby preserving the ability to run data and schema migrations in sequence in anyone's rails application
  2. we don't, and data migrations run as a separate task

If I want to be able to run rake db:migrate:with_data then the data migrations need to follow the same patterns as the schema migrations. This is not the only way to migrate the data, but that's how the gem currently works, and so that's how I applied the changes in my pull request.

If we do continue to follow the rails migration patterns, we'll have to account for the complex configuration shown above, and part of that will be a data_migrations table on each database, among other minor changes.


this sounds like a big problem

I agree, this should be fixed. I can look into it someone soon, but don't have an exact timeline.


@ilyakatz I can join a call sometime, and I'm happy to provide time to updates, assuming in alignment on the end goal. You're the maintainer, so you get to decide where the project goes from here, and I don't want to get in the way of your progress if we're headed in different directions. I'd like to see the gem support migrations the way rails does the schema, but I understand that this may not fit in your vision of how the gem should work. Let me know your thoughts.

ilyakatz commented 1 year ago

@chaunce thanks, could you email me and we'll coordinate with @wildmaples to set up a call

opti commented 1 year ago

Hi everyone,

I tend to agree with @chaunce on following the Rails approach for migrations to run on specific connections. It might sounds like an overkill, but should be more versatile. We all might have different database setups since Rails allows that, but it doesn't mean that data_migrate should pick only one to support. As long as this gem claims that data migrations act like schema migrations, similar implementation seems reasonable.

However, regardless of the approach, I don't mind some breaking changes as long as there is a clear upgrade path documented somewhere.

P.S. @ilyakatz We have situation similar to Gusto (maybe not that heavily using data migrations though) and as of version 10 data migrations stopped working for us completely. Hence my recent activity with some fixes.

ilyakatz commented 1 year ago

Hi folks, just wanted to provide an update, we had a call with @ngan and @wildmaples. They will help on gem development, particularly for multi-database support and/or rails 7.1 support. I've added them as contributors. Since v10 has introduced issues with many folks, we will need to roll it back. Thanks @chaunce for your contribution, we certainly appreciate your work but seems we'd need another approach to keep it usable for wider audience. This was a good learning experience for sure.

I will yank v10.0.x versions to avoid people picking up versions that are not working