thejsj / rethinkdb-init

Create all RethinkDB databases, tables and indexes automatically through a schema object.
https://www.npmjs.com/package/rethinkdb-init
53 stars 15 forks source link

Duplicate tables and databases #15

Open artyomxx opened 8 years ago

artyomxx commented 8 years ago

Sometimes something strange happens and I find that there are duplicate tables or even databases exist in our RethinkDB server. For some reason I think this could happen when connection to RDB is interrupted due to RDB restart or some other reason. May your module try to create a database or a table in this situation?

thejsj commented 8 years ago

Hmmm... I mean, every time you call r.init it will actually try to create the database and tables and then just handle the error if it already exists. But there should never ben any duplicate tables/databases since they're all supposed to be unique.

Can you post most details about this? Maybe post the results of running r.db('rethinkdb').table('table_config') and r.db('rethinkdb').table('db_config') to see the duplicate ones? Do you have multiple nodes running?

If this is actually a problem that's somehow caused by rethinkdb-init, then a possible solution might be to first check the existence of a db/table before creating it but, the really database should not allow this to happen in the first place.

artyomxx commented 8 years ago

database should not allow this to happen in the first place

I thought so. But here it is. So assuming that I may not know something, I decided to first start an issue here, and then maybe go to rethinkdb issues.

Yes, my environment consists of at least three nodes, each connected through its own rethinkdb proxy to the main rethinkdb server. Two of them use a shared database (for sessions). Also, sometimes my dev team launch their own nodes connecting to the same sever and using the same databases for debugging needs.

a possible solution might be to first check the existence of a db/table before creating

IMHO, this looks better from the engineering perspective. ;)

thejsj commented 8 years ago

Can you post the results of r.db('rethinkdb').table('table_config') and r.db('rethinkdb').table('db_config') filtered by the table/db name to see the duplicate ones? Also, what happens when you pick a database that has two entries? Which one does it pick? Is it random?

IMHO, this looks better from the engineering perspective. ;)

The problem with that is the following, if it was able to create the database twice, who says it's going to know about that database in the first place? Would would dbList/tableList know about a database while dbCreate/tableCreate would not? Even if, for whatever reason, dbList/tableList had accurate information, you still have a race condition where a database/table could be inserted between the list query and the create query (even if it's in the same query!). Granted, this second point is still pretty minor, but something to consider.

thejsj commented 8 years ago

I'm going to try to reproduce this and see if I can get it to happen too.

More question for trying to reproduce this:

  1. Are these all in the same datacenter?
  2. Do the proxy nodes run locally? In a separate server? Same a host DB? Same datacenter?
  3. Do you know if one of your nodes was down?
  4. What are the database/table names? Does it only happen with certain databases and not with others?

Anything else that could help in reproducing this?

artyomxx commented 8 years ago

Can you post the results of r.db('rethinkdb').table('table_config') and r.db('rethinkdb').table('db_config') filtered by the table/db name to see the duplicate ones?

Sorry, the last time it happened I fixed it as soon as I found the problem (with renaming and then deleting clones from RDB web-console). Because it prevents new nodes to work properly - RethinkDB driver would not connect displaying messages like ambiguous table/database name ....

  1. Yes
  2. Yes, all proxies run locally and connect to the RehinkDB server running on the same virtual machine. When someone of my dev team runs a local node, he uses a connection established through a ssh-tunnel (dev os > node > rdb-proxy > ssh-tunnel (with port 28015) > rdb server). So, it works like a server's local node.
  3. Not sure.
  4. No, random tables and databases.
artyomxx commented 8 years ago

The problem with that is the following, if it was able to create the database twice, who says it's going to know about that database in the first place?

Yeah, I understand.

thejsj commented 8 years ago

Ok, a couple of more questions:

  1. How did you deleted it then? Through the id in rethinkdb.db_config/rethinkdb.table_config tables?

Yes, all proxies run locally and connect to the RehinkDB server running on the same virtual machine. When someone of my dev team runs a local node, he uses a connection established through a ssh-tunnel (dev os > node > rdb-proxy > ssh-tunnel (with port 28015) > rdb server). So, it works like a server's local node.

  1. So this happens locally too? Even when you only have only 1 non-proxy RethinkDB node in the cluster? If so, what are you using for your virtual machine (Not sure it matters that much, but just checking)?

I think this might be enough to at least try to replicate it.

artyomxx commented 8 years ago
  1. RethinkDB web-console provides a way to rename it in the 'issues' section. I guess it's the only option, since all other ways will lead you to ambiguous table/db name.
  2. No. This happens with RethinkDB server which is running on the virtual machine we use to connect with ssh-tunnels to. So, when we run our nodes locally we are actually working with the remote database on the virtual machine (a Digital Ocean droplet). The ssh-tunnel provides a connection to the main RDB server process, so the dev's local rethinkdb proxy process is connecting to the RethinkDB on the remote server. (I mistyped the port, 29015 - default cluster port, of course). In the same time, there are a few of production-ready nodes running on this server, working with their own rethinkdb proxies and using the same RethinkDB server which is local for them.

Oh. Hope it's possible to understand all this. :)

But I guess that all this things with ssh-tunnels and local-node-remote-rdb are not important, because if I understand correctly, this happened a few times when no one was using this. So, there were two nodes using one shared database through their own rethinkdb proxy connecting to the localhost:29015.

thejsj commented 8 years ago

Hey @lolwhoami, @tjmehta was able to repro this and get these two guys, so this is definitely a thing. Seems to be a RethinkDB error for sure though (TJ doesn't use this module), but I'd be interesting to get a good way to repro and see if anything can be done about it.

screen shot 2016-06-30 at 4 25 12 pm screen shot 2016-06-30 at 4 27 05 pm
artyomxx commented 7 years ago

Well, if I understand correctly, this happens when rethinkdb proxy is started and node is connected, but the proxy is not yet fully connected to the server, so it doesn't know that some tables or databases exist. So could we ensure somehow if the proxy is fully connected? Maybe requesting the dbs/tables list will do the trick?

artyomxx commented 7 years ago

Just to let you know and maybe close the issue.

I've been running some tests with creation of dbs and tables through rethinkdb-proxy and found out that checking for db(...).config() and table(...).status() before creation of dbs and tables doesn't help at all. It all looks like at some point rethinkdb driver thinks that there are no such tables, so it's ok to create them. Also, sometimes status() returns non-existing errors, but the following tableCreate() fails with already exists error. Just to mention, it's looks like it's ok with creation of databases - I couldn't make dbCreate() to create a duplicate database, and it looks like db(...).config() never fails to find an existing database. So, the problem is with tables only. Also, r.db(...).wait() and r.db(...).table(...).wait() don't help too, they just resolve immediately before creating duplicate tables.

So, looks like the only solution here is to use some separate file or db to store information about actual db structure, just as they stated in the issue you've mentioned. But it's so weird! Oh...

artyomxx commented 7 years ago

I've built my own version of rethinkdb-init with db state saved in json file. It's here: https://github.com/lolwhoami/rdb-init

tayler-king commented 6 years ago

Hey, I've created my own handler for initiating new databases similar to this project for an application i'm working on. Unfortunately there isn't much of a way against the race conditions but I did write this little gem which disgustingly solves the problem:

  _RemoveRethinkCollisions() {
      this._Database.RethinkDB.db('rethinkdb').table('current_issues').filter({
          type: 'table_name_collision'
      }).run().then(Issues => {
          if (!Issues.length)
              return Logger.Info('No issues found');
          const TablesToRemove = Issues.map(Issue => Issue.info.ids.slice(1)).reduce((Array, Tables) => Array.concat(Tables), []);
          this._Database.RethinkDB.db('rethinkdb').table('table_config').getAll(...TablesToRemove).delete().run().then(_ => {
              console.log(_);
          })
      })
  }

This might not be appropriate to your use case and this issue is over 7 months old but I wanted to provide some insight in case it is helpful.

Cheers.