popbr / data-integration

Apache License 2.0
1 stars 4 forks source link

Dropping tables with Foreign Keys #12

Closed MNSleeper closed 1 year ago

MNSleeper commented 1 year ago

The progress block right now is foreign keys (from linkage) blocking tables from being dropped and created (tables from last execution not being able to be dropped) on new executions. Solutions I've found are:

  1. Finding a way to destroy all FKs at the end of the program (before the system exit)
  2. Using "SET FOREIGN_KEY_CHECKS = 0;", dropping a table, "SET FOREIGN_KEY_CHECKS = 1;" at every instance of a DropTable to drop it. a. My thought line is, if we do this at every drop table, there will be no FKs left to mess with droppage. But, this leaves tables in the DB if a table is created in one execution, but not in another -- essentially, it gets created, but never dropped on subsequent executions of the program, so it stays there permanently. This creates clutter, right?
  3. Dropping the Database. (Last choice)

@aubertc , what do you suggest? I'm leaning towards 2, but know there might be unintended effects I don't know about that such a method would cause.

aubertc commented 1 year ago

This may be a XY problem: why are you trying to drop the tables in the first place? Your program should be able to update the tables, rather than dropping everything and starting from scratch.

Now, for debugging purposes, it could be useful indeed to "start fresh", but in that case, all the databases should be dropped, and not just the foreign key constraint.

MNSleeper commented 1 year ago

Right. I just worry that, by overwriting/updating tables, there will be leftover entries from other executions. Is that a legitimate fear?

aubertc commented 1 year ago

Right. I just worry that, by overwriting/updating tables, there will be leftover entries from other executions. Is that a legitimate fear?

This is a more complex discussion that we should maybe have in-person. What is this software exactly trying to achieve?


One-time process

Is it a "one-time process" that takes one (or multiple) databases, and output one spreadsheet with the all data linked and tidied?

E.g. it takes databases from the NSF and the NIH and output one single spreadsheet containing the combined information?

Spread-sheet calculator

Is it a "spread-sheet calculator" that process data inside its database and output one spreadsheet with some parameters fixed?

E.g., it loads data from the NSF, the NIH, etc. once and for all, asks for user-input, and output the spreadsheet containing the information relevant to the user-input.


I believe that, ideally, it should be a software

Then, if a new database is added or if a database is updated, the process should be:

In all of those cases, I don't think that dropping anything is needed, but may be wrong.

aubertc commented 1 year ago

if it is, observe the differences,

This is a complicated problem, but we can find ways. One possible process is (here, I am supposing that dataA.csv had an older version loaded in the table DataA in sql):

aubertc commented 1 year ago

Oh, and there are probably tools that can do that for us, I can investigate if you want.

MNSleeper commented 1 year ago

After reading this, I agree that the internal databases should be kept over multiple execution, and that this is very complicated.

For the purposes of building/testing/improving the matching system, I want be able to start over until we are confident that the matching methods/algorithm are solid. Once its at a good spot, making it consistent across executions sounds like a good way of going about it.

For now, I can just drop the database to get around the FK constraint errors, right?

aubertc commented 1 year ago

For now, I can just drop the database to get around the FK constraint errors, right?

Yes, absolutely, you can always "start fresh" and dump everything at the end of each execution.

MNSleeper commented 1 year ago

Then I will start on this.

aubertc commented 1 year ago

Ok, that's probably a good first approximation: make it work without the "updating" capacities. It will be slower, but it will be easier to debug / test.

MNSleeper commented 1 year ago

Added method to drop the db and create it before doing anything else in SQL. Now I can access the matching function.

aubertc commented 1 year ago

I guess we can close this issue, now, and potentially open another one to discuss the "let's update instead of dumping everything" issues.

MNSleeper commented 1 year ago

That sounds like a plan.