stripe-archive / mosql

MongoDB → PostgreSQL streaming replication
MIT License
1.63k stars 225 forks source link

Duplicating records and tailing the oplog? #57

Closed AndreaCrotti closed 10 years ago

AndreaCrotti commented 10 years ago
  1. Why if I run the command in this way

    mosql -c comments.yaml --mongo mongodb://django:django@localhost:27017/ --sql postgres://django:django@localhost/garage -v -v -v --no-drop-tables

    It quits when it's done importing? Should it not run forever reading the oplog?

    Is it not possible to let it run forever tailing the oplog (until stopped)?

  2. If I run the migration with a schema as below multiple times, my records get duplicated.

    This is maybe because I am not specifying an ID so I'm letting Postgres create new records with new ids. But is it not possible to make it only create new record if not there already in another way?

    I just want to make sure that if I have to run it twice I don't get duplicate data..

    alexandria:
      searchable_comment:
        :columns:
        - text: TEXT
        - status: TEXT
        - author_id:
          :source: user.user_id
          :type: INT
        - product_id: INT
        - date: TIMESTAMP

        :meta:
          :table: dp_comments_comment

Thanks

AndreaCrotti commented 10 years ago

Because the problem is that the id in Mongo is a UUID not just a long auto incrementing, I would rather not use that as Primary Key in Postgres.

The alternative I suppose could be that if something fails I do a --reimport, or if use tail-from with the timestamp when it stopped, does it make sense?

AndreaCrotti commented 10 years ago

I noticed now this

 WARN MoSQL: `mongodb://django:django@localhost:27017/' is not a replset.
 WARN MoSQL: Will run the initial import, then stop.
 WARN MoSQL: Pass `--skip-tail' to suppress this warning

So maybe thinking that the two things are correlated.. So if it's not a replset then it will not tail? To test it locally I need to create a replica set then?

nelhage commented 10 years ago

MoSQL needs a node configured as --replSet in order to tail. You don't actually need a multiple-node cluster, but MoSQL relies on reading the MongoDB oplog in order to replicate data, which isn't written out unless the Mongo server is configured as a replset node.

MoSQL requires that your collection map include a mapping for the _id field in order to properly replicate changes and to de-duplicate identical records. Since _id is the primary key on the MongoDB side, we can't tell whether a record is a duplicate or a new record unless we also preserve the _id field onto the Postgres side

AndreaCrotti commented 10 years ago

Ok thanks now I understood how it works, for production we have a replica set I just have to configure one also locally to test it..

About the _id yes fair enough makes sense, but the question is, can I map the MongoDB id to a field that is not the primary key in PostgreSQL? Would it still be able to de-duplicate identical records?

Because I would really like to avoid having a primary key an UUID, Django doesn't like so much if you mess around with the primary keys..

AndreaCrotti commented 10 years ago

I did some experiment myself and unfortunately I think the mapping does not work if it's not the primary key.

I just added this mapping

- couch_id:
  :source: _id
  :type: UUID

And couch_id to the Postgres model, but if I import twice it still duplicates the rows. In theory now however it could work, because it has all the information needed to understand that the row has already been inserted right?

AndreaCrotti commented 10 years ago

By the way I found out the trick, the field I map to does not have to be a primary key as long it's unique and not null.

Maybe this would be nice to be documented somewhere since it's not so evident otherwise, thanks!

nelhage commented 10 years ago

Great, glad it's working out! I'll close this for now; I'd probably consider a PR adding documentation, but I won't likely get around to writing any just yet.