MNSleeper commented 1 year ago

The progress block right now is foreign keys (from linkage) blocking tables from being dropped and created (tables from last execution not being able to be dropped) on new executions. Solutions I've found are:

Finding a way to destroy all FKs at the end of the program (before the system exit)
Using "SET FOREIGN_KEY_CHECKS = 0;", dropping a table, "SET FOREIGN_KEY_CHECKS = 1;" at every instance of a DropTable to drop it. a. My thought line is, if we do this at every drop table, there will be no FKs left to mess with droppage. But, this leaves tables in the DB if a table is created in one execution, but not in another -- essentially, it gets created, but never dropped on subsequent executions of the program, so it stays there permanently. This creates clutter, right?
Dropping the Database. (Last choice)

@aubertc , what do you suggest? I'm leaning towards 2, but know there might be unintended effects I don't know about that such a method would cause.

aubertc commented 1 year ago

This may be a XY problem: why are you trying to drop the tables in the first place? Your program should be able to update the tables, rather than dropping everything and starting from scratch.

Now, for debugging purposes, it could be useful indeed to "start fresh", but in that case, all the databases should be dropped, and not just the foreign key constraint.

MNSleeper commented 1 year ago

Right. I just worry that, by overwriting/updating tables, there will be leftover entries from other executions. Is that a legitimate fear?

aubertc commented 1 year ago

Right. I just worry that, by overwriting/updating tables, there will be leftover entries from other executions. Is that a legitimate fear?

This is a more complex discussion that we should maybe have in-person. What is this software exactly trying to achieve?

One-time process

Is it a "one-time process" that takes one (or multiple) databases, and output one spreadsheet with the all data linked and tidied?

E.g. it takes databases from the NSF and the NIH and output one single spreadsheet containing the combined information?

Spread-sheet calculator

Is it a "spread-sheet calculator" that process data inside its database and output one spreadsheet with some parameters fixed?

E.g., it loads data from the NSF, the NIH, etc. once and for all, asks for user-input, and output the spreadsheet containing the information relevant to the user-input.

I believe that, ideally, it should be a software

that has an internal database of databases (so, one table for the NSF entries, one table for the NIH entries, etc.), that can be expanded and tweaked,
that connects those databases using additional tables to "link them",
that can output, based on current information, spreadsheets containing all the information, or the information filtered along different criterias.

Then, if a new database is added or if a database is updated, the process should be:

look if this database is already in the set,
if it is, observe the differences,
if new information was added, add it to the program's database,
if old information was deleted, remove it <--- I believe this is what you were asking,
if the new information conflict with what we have, then indeed we need to think about that case (I believe I would simply take the new information).

In all of those cases, I don't think that dropping anything is needed, but may be wrong.

aubertc commented 1 year ago

if it is, observe the differences,

This is a complicated problem, but we can find ways. One possible process is (here, I am supposing that dataA.csv had an older version loaded in the table DataA in sql):

Create a temporary table, DataATemp,
Compare DataA and DataATemp (probably by PK),
If an entry is in DataATemp but not in DataA, insert it,
If an entry is in DataA but not in DataATemp, delete it,
If an entry is in both DataATemp in DataA, then … more work is needed, and we can cross that bridge when we get there.

aubertc commented 1 year ago

Oh, and there are probably tools that can do that for us, I can investigate if you want.

MNSleeper commented 1 year ago

After reading this, I agree that the internal databases should be kept over multiple execution, and that this is very complicated.

For the purposes of building/testing/improving the matching system, I want be able to start over until we are confident that the matching methods/algorithm are solid. Once its at a good spot, making it consistent across executions sounds like a good way of going about it.

For now, I can just drop the database to get around the FK constraint errors, right?

aubertc commented 1 year ago

For now, I can just drop the database to get around the FK constraint errors, right?

Yes, absolutely, you can always "start fresh" and dump everything at the end of each execution.

MNSleeper commented 1 year ago

Then I will start on this.

aubertc commented 1 year ago

Ok, that's probably a good first approximation: make it work without the "updating" capacities. It will be slower, but it will be easier to debug / test.

MNSleeper commented 1 year ago

Added method to drop the db and create it before doing anything else in SQL. Now I can access the matching function.

aubertc commented 1 year ago

I guess we can close this issue, now, and potentially open another one to discuss the "let's update instead of dumping everything" issues.

MNSleeper commented 1 year ago

That sounds like a plan.

popbr / data-integration

Dropping tables with Foreign Keys #12

One-time process

Spread-sheet calculator