Open aubertc opened 1 year ago
Please, can you provide an update on that? Having some simple tests would help. I would also be curious to test your work on a single database: can we identify when the same entity is present twice in the same data set, but under different names / spelling? How close can we get to that?
Once I know where you stand, I can provide more precise guidance.
I can't get to matching right now until the foreign key/drop table issue is solved. By that, I mean I can't gauge where the matching alg. is at without solving the issue so I can run it.
I assume you are referring to https://github.com/popbr/data-integration/issues/12 , ok.
Yes, that is correct.
Matching is now accessible.If you run the method, know that there are System.out.printlns I use for testing that will significantly slow the program. Be wary if you use it.
As it stands now, the program puts them into the linktable if it does not exactly match other entries in the linktable. for example, "Loma University" and "Loma University " and two separate entities does to the space at the end.
In addition to addressing my comment in the other issue, could you be more specific as to what is "running the method"?
Ideally, you could even have a test database that you could import and on which you could test your matching procedure.
"Running the method", in my interpretation of it, meant running the program with the Linkage methods
After some data analysis, I can confirm that the matching program only adds items to the LinkTable if it isn't in there -- I tested this by examining a list of outputted additions to the table and confirming there are no matches. I can also confirm that the foreign keys are working, as there are 267 foreign keys, all pointing to the db where the entities are stored. Notably, there are some close errors like "Loma University" and "Loma University", and "Maryland" university and "Marlyland" University that get picked as unique entities.
"Running the method", in my interpretation of it, meant running the program with the Linkage methods
So, you mean not commenting https://github.com/popbr/data-integration/blob/15c1bf5f44cb40ea08da66ac2a1bc715f7a7533c/Project/Database-IO/src/main/java/popbr/DatabaseIO.java#L122 ?
Yes, running the methods CreateLinkageTable and LinkTable are what I mean
Looking at https://github.com/popbr/data-integration/blob/15c1bf5f44cb40ea08da66ac2a1bc715f7a7533c/Project/Database-IO/src/main/java/popbr/DatabaseIO.java#L927 , I feel like there is a disconnection.
By "matching" you mean:
Am I correct?
By "matching" I meant "during part 2., make sure that there isn't in the LinkTable an entry that already corresponds to that same entity". Of course, we can't look at this problem without doing 1./2./3. first, even partially.
You are correct about points 1-3. and, to your next point, unique entries only, I can confirm that there are only unique entries. I confirmed this by comparing the table length of the example DB's SQL table (267 entries) versus the SQL table for the Linktable (231 entries) and I confirmed that the 36 entry deficit were all repeat entries of prior entries in the database
Perhaps it's a misunderstanding on my part, by when I was looking up pulling all entries linked by a foreign key in SQL, i read that people use inner joins to do so, and was wondering why we don't use a large set of inner join commands over foreign keys?
Also, @aubertc , if you know a way to select all entities linked by a single foreign key, I would like to know it/know where to look to get an idea. It would being the vein of "Select an entity and all other entities linked to it by FKs"
large set of inner join commands over foreign keys?
We could probably if we were not trying to identify some of those entities (i.e., try to match "Augusta Univ." with "Augusta University"). I don't think inner join can do this sort of "approximate matching", but may be wrong.
Also, @aubertc , if you know a way to select all entities linked by a single foreign key, I would like to know it/know where to look to get an idea. It would being the vein of "Select an entity and all other entities linked to it by FKs"
Try to present the problem in a more minimal way. Write an sql script that create the databases like you want them to be, and try to write the command. I'll be happy to help if you want.
Create a foreign key from from the LinkTable to the original table(s).
Hold on, that's not right. The idea is that an entry in the LinkTable may contain an entity present in multiple entries in the original data, so the foreign key should be the other way around. Or you should have a "cross-reference" table.
Does that make sense? A live discussion may be overdue on this.
I am somewhat confused, but I do understand the base idea. A live discussion would be great.
Please, comment on the current status of the matching problem: what is implemented, what is left to do, what issues are you facing?