popbr / data-integration

Apache License 2.0
1 stars 4 forks source link

Status of matching #9

Open aubertc opened 1 year ago

aubertc commented 1 year ago

Please, comment on the current status of the matching problem: what is implemented, what is left to do, what issues are you facing?

aubertc commented 1 year ago

Please, can you provide an update on that? Having some simple tests would help. I would also be curious to test your work on a single database: can we identify when the same entity is present twice in the same data set, but under different names / spelling? How close can we get to that?

Once I know where you stand, I can provide more precise guidance.

MNSleeper commented 1 year ago

I can't get to matching right now until the foreign key/drop table issue is solved. By that, I mean I can't gauge where the matching alg. is at without solving the issue so I can run it.

aubertc commented 1 year ago

I assume you are referring to https://github.com/popbr/data-integration/issues/12 , ok.

MNSleeper commented 1 year ago

Yes, that is correct.

MNSleeper commented 1 year ago

Matching is now accessible.If you run the method, know that there are System.out.printlns I use for testing that will significantly slow the program. Be wary if you use it.

As it stands now, the program puts them into the linktable if it does not exactly match other entries in the linktable. for example, "Loma University" and "Loma University " and two separate entities does to the space at the end.

aubertc commented 1 year ago

In addition to addressing my comment in the other issue, could you be more specific as to what is "running the method"?

Ideally, you could even have a test database that you could import and on which you could test your matching procedure.

MNSleeper commented 1 year ago

"Running the method", in my interpretation of it, meant running the program with the Linkage methods

MNSleeper commented 1 year ago

After some data analysis, I can confirm that the matching program only adds items to the LinkTable if it isn't in there -- I tested this by examining a list of outputted additions to the table and confirming there are no matches. I can also confirm that the foreign keys are working, as there are 267 foreign keys, all pointing to the db where the entities are stored. Notably, there are some close errors like "Loma University" and "Loma University", and "Maryland" university and "Marlyland" University that get picked as unique entities.

aubertc commented 1 year ago

"Running the method", in my interpretation of it, meant running the program with the Linkage methods

So, you mean not commenting https://github.com/popbr/data-integration/blob/15c1bf5f44cb40ea08da66ac2a1bc715f7a7533c/Project/Database-IO/src/main/java/popbr/DatabaseIO.java#L122 ?

MNSleeper commented 1 year ago

Yes, running the methods CreateLinkageTable and LinkTable are what I mean

aubertc commented 1 year ago

Looking at https://github.com/popbr/data-integration/blob/15c1bf5f44cb40ea08da66ac2a1bc715f7a7533c/Project/Database-IO/src/main/java/popbr/DatabaseIO.java#L927 , I feel like there is a disconnection.

By "matching" you mean:

  1. Parse the table(s?) created by importing the databases,
  2. Create one entry in the "LinkTable" per entity (in that case, University's name)
  3. Create a foreign key from from the LinkTable to the original table(s).

Am I correct?

By "matching" I meant "during part 2., make sure that there isn't in the LinkTable an entry that already corresponds to that same entity". Of course, we can't look at this problem without doing 1./2./3. first, even partially.

MNSleeper commented 1 year ago

You are correct about points 1-3. and, to your next point, unique entries only, I can confirm that there are only unique entries. I confirmed this by comparing the table length of the example DB's SQL table (267 entries) versus the SQL table for the Linktable (231 entries) and I confirmed that the 36 entry deficit were all repeat entries of prior entries in the database

MNSleeper commented 1 year ago

Perhaps it's a misunderstanding on my part, by when I was looking up pulling all entries linked by a foreign key in SQL, i read that people use inner joins to do so, and was wondering why we don't use a large set of inner join commands over foreign keys?

Also, @aubertc , if you know a way to select all entities linked by a single foreign key, I would like to know it/know where to look to get an idea. It would being the vein of "Select an entity and all other entities linked to it by FKs"

aubertc commented 1 year ago

large set of inner join commands over foreign keys?

We could probably if we were not trying to identify some of those entities (i.e., try to match "Augusta Univ." with "Augusta University"). I don't think inner join can do this sort of "approximate matching", but may be wrong.

Also, @aubertc , if you know a way to select all entities linked by a single foreign key, I would like to know it/know where to look to get an idea. It would being the vein of "Select an entity and all other entities linked to it by FKs"

Try to present the problem in a more minimal way. Write an sql script that create the databases like you want them to be, and try to write the command. I'll be happy to help if you want.

aubertc commented 1 year ago

Create a foreign key from from the LinkTable to the original table(s).

Hold on, that's not right. The idea is that an entry in the LinkTable may contain an entity present in multiple entries in the original data, so the foreign key should be the other way around. Or you should have a "cross-reference" table.

Does that make sense? A live discussion may be overdue on this.

MNSleeper commented 1 year ago

I am somewhat confused, but I do understand the base idea. A live discussion would be great.