the user has access to a path that contains a collection of unstructured tables, and wants to find which of those tables may be joined on a main table (what would be "X").
a table can be joined (is a "candidate table") on a column in X if its Jaccard containment is large
this object takes as parameters the path that contains the tables (may include subfolders), the list of columns that must be checked as possible join keys ("query columns"), a budget to keep only the top-k candidates, and X at fit time
during fit/transform, all the tables in the path are loaded in memory, all categorical/string columns* are selected and the Jaccard containment of each query column (from X) is measured against all the selected columns in the loaded tables
the output of transform is a ranking of candidate joins that includes the query column in the main table, the path to the candidate table, the key in the candidate table and the containment
Some of the problems/things that should be considered for later:
*We may not be interested only in categorical columns: integer columns (that represent integers) may also work. Integers will cause trouble because they may lead to false positives.
Measuring the containment is an expensive operation that should be optimized well (but it can also be parallelized easily).
Loading all tables in memory may cause memory issues (more optimization).
The MultiAggJoiner could be used at the end to combine all the candidates into a big table.
Jaccard containment is only one metric, more profiling tools (maybe target based) could be included,
We can start from providing a path, but later we might want to directly include bindings to connect with DBMS.
Notes on the current implementation:
The code is in _discover.py for now.
All the testing is missing (working on it). How do I set up testing for files that will be on a specific path?
I had to create read_parquet and read_csv dispatched functions, but I am having trouble because the only argument that they need is input_path, and the dispatcher can't use that to decide which library to use. For the time being, I am passing X just to give that information.
For now, the functions used by the Discover object are not in the class itself, should I move them in? It's probably the better option.
I put TODOs for some of the important points.
The current code version is barebones, but it runs. Mostly, I am having trouble with integrating the code properly.
Hello, this will be a long one!
The objective of this PR is adapting part of the pipeline from this paper: https://arxiv.org/abs/2402.06282 (repo https://github.com/rcap107/retrieve-merge-predict).
Main points:
transform
is a ranking of candidate joins that includes the query column in the main table, the path to the candidate table, the key in the candidate table and the containmentSome of the problems/things that should be considered for later:
MultiAggJoiner
could be used at the end to combine all the candidates into a big table.Notes on the current implementation:
_discover.py
for now.read_parquet
andread_csv
dispatched functions, but I am having trouble because the only argument that they need isinput_path
, and the dispatcher can't use that to decide which library to use. For the time being, I am passing X just to give that information.Discover
object are not in the class itself, should I move them in? It's probably the better option.The current code version is barebones, but it runs. Mostly, I am having trouble with integrating the code properly.