megagonlabs / starmie

Resources for PVLDB 2023 submission
16 stars 5 forks source link

Semantics-aware Dataset Discovery from Data Lakes with Contextualized Column-based Representation Learning

The overall architecture of Starmie

Requirements

Install requirements:

pip install -r requirements

Datasets

Datasets for table union search:

WDC web tables:

Viznet: https://github.com/megagonlabs/sato/tree/master/table_data

Running the offline pre-training pipeline:

The main entry point is run_pretrain.py. Example command:

CUDA_VISIBLE_DEVICES=0 python run_pretrain.py \
  --task viznet \
  --batch_size 64 \
  --lr 5e-5 \
  --lm roberta \
  --n_epochs 3 \
  --max_len 128 \
  --size 10000 \
  --projector 768 \
  --save_model \
  --augment_op drop_col \
  --fp16 \
  --sample_meth head \
  --table_order column \
  --run_id 0

Hyperparameters:

Model Inference:

Run extractVectors.py. Example command:

python extractVectors.py \
  --benchmark santos \
  --table_order column \
  --run_id 0

Hyperparameters

Online processing

  1. Linear & Bounds: Run test_naive_search.py. Some scripts are in tus_cmd.sh and run_tus_all.py (for slurm scheduling). Example command:
python test_naive_search.py \
  --encoder cl \
  --benchmark santos \
  --augment_op drop_col \
  --sample_meth tfidf_entity \
  --matching linear \
  --table_order column \
  --run_id 0 \
  --K 10 \
  --threshold 0.7

Hyperparameters

FOR ERROR ANALYSIS: bucket (bucket number between 0 and 5), analysis (either "col" for number of columns, "row" for number of rows,numeric" for percentage of numerical columns

FOR SCALABILITY EXPERIMENTS: scal (what fraction of data lake do we want to get the metrics scores for – 0.2,0.4,0.6,0.8,1.0)

  1. LSH: Run test_lsh.py (example script: lsh_cmd.sh). Example command:
python test_lsh.py \
--encoder cl \
--benchmark santosLarge \
--run_id 0 \
--num_func 8 \
--num_table 100 \
--K 60 \
--scal 1.0

Hyperparameters:

FOR SCALABILITY EXPERIMENTS: scal (what fraction of data lake do we want to get the metrics scores for – 0.2,0.4,0.6,0.8,1.0)

  1. HNSW: Run test_hnsw_search.py (example script: hnsw_cmd.sh). Example command:
    python test_hnsw_search.py \
    --encoder cl \
    --benchmark santosLarge \
    --run_id 0 \
    --K 60 \
    --scal 1.0

Hyperparameters:

FOR SCALABILITY EXPERIMENTS: scal (what fraction of data lake do we want to get the metrics scores for – 0.2,0.4,0.6,0.8,1.0)

Data discovery for ML tasks:

Run discovery.py. We assume:

  1. A model checkpoint in results/viznet/model_drop_col_head_column_0.pt
  2. The viznet dataset in data/viznet/

Run the script by

python discovery.py

The code will print out the MSE for NoJoin, contrastiving learning (CL), Jaccard, and Overlap. The joined tables will be output to pickled files named none_joined_tables.pkl, cl_joined_tables.pkl, jaccard_joined_tables.pkl, and overlap_joined_tables.pkl.

Column clustering:

See Line 273 and Line 128 of the file sdd/pretrain.py. To run column clustering, you can run a sequence of commands (remember to check the file paths):

CUDA_VISIBLE_DEVICES=7 python run_pretrain.py \
  --task viznet \
  --batch_size 64 \
  --lr 5e-5 \
  --lm roberta \
  --n_epochs 3 \
  --max_len 128 \
  --size 10000 \
  --projector 768 \
  --save_model \
  --augment_op drop_col \
  --fp16 \
  --sample_meth head \
  --table_order column \
  --run_id 0

Copy the clustering results:

cp *.pkl data/viznet/multi_column

Each run will pre-train the models on 10k viznet tables and cluster all the columns. The clustering results will be stored at data/viznet/multi_column/clusters.pkl and data/viznet/single_column/.

To view the clusters, you can use the jupyter notebook in notebook/offline.ipynb. Running the last cell should print out some clusters like

artist ---- 1. I Don't Give A ...; 2. I'm The Kinda; 3. I U She; 4. Kick It [featuring Iggy Pop]; 5.
Operate
artist ---- 1. Spoken Intro; 2. The Court; 3. Maze; 4. Girl Talk; 5. A La Mode
artist ---- 1. Street Fighting Man; 2. Gimme Shelter; 3. (I Can't Get No) Satisfaction; 4. The
Last Time; 5. Jumpin' Jack Flash
…
---------------------------------
type ---- Emerson Elementary School; Banneker Elementary School; Silver City Elementary
School; New Stanley Elementary School; Frances Willard Elementary School
type ---- Choctawhatchee Senior High School; Fort Walton Beach High School; Ami Kids
Emerald Coast; Gulf Coast Christian School; Adolescent Substance Abuse
city ---- Chilton; Stoughton
…
---------------------------------
description ---- Fri Sep 11,2015 3:30 PM (CST); Fri Sep 11,2015 6:00 PM (CST); Sat Sep
12,2015 10:00 AM (CST); Sat Sep 12,2015 12:00 PM (CST); Sat Sep 12,2015 5:30 PM (CST)
day ---- Sept. 1; Sept. 7; Sept. 22; Sept. 29; Oct. 5
description ---- Fri Sep 11,2015 3:30 PM (CST); Fri Sep 11,2015 6:00 PM (CST); Sat Sep
12,2015 10:00 AM (CST); Sat Sep 12,2015 12:00 PM (CST); Sat Sep 12,2015 5:30 PM (CST)
...
address ---- 1721 Papillon St, North Port FL; 4113 Wabasso Ave, North Port FL; 3681
Wayward Ave, North Port FL; 1118 N Salford Blvd, North Port FL; 2057 Bendix Ter, North
Port FL
address ---- 5 Brand Rd, Toms River NJ; 40 12th St, Toms River NJ; 75 Sea Breeze Rd,
Toms River NJ; 98 Oak Tree Ln, Toms River NJ; 67 16th St, Toms River NJ
address ---- 652 Martha St, Montgomery AL; 3184 Lexington Rd, Montgomery AL; 120 S
Lewis St, Montgomery AL; 1812 W 2nd St #OP, Montgomery AL; 3582 Southview Ave,
Montgomery AL
---------------------------------

Citation

If you are using the code in this repo, please cite the following in your work:

@article{DBLP:journals/pvldb/FanWLZM23,
  author       = {Grace Fan and
                  Jin Wang and
                  Yuliang Li and
                  Dan Zhang and
                  Ren{\'{e}}e J. Miller},
  title        = {Semantics-aware Dataset Discovery from Data Lakes with Contextualized
                  Column-based Representation Learning},
  journal      = {Proc. {VLDB} Endow.},
  volume       = {16},
  number       = {7},
  pages        = {1726--1739},
  year         = {2023}
}

Disclosure

Embedded in, or bundled with, this product are open source software (OSS) components, datasets and other third party components identified below. The license terms respectively governing the datasets and third-party components continue to govern those portions, and you agree to those license terms, which, when applicable, specifically limit any distribution. You may receive a copy of, distribute and/or modify any open source code for the OSS component under the terms of their respective licenses. In the event of conflicts between Megagon Labs, Inc. Recruit Co., Ltd., license conditions and the Open Source Software license conditions, the Open Source Software conditions shall prevail with respect to the Open Source Software portions of the software. You agree not to, and are not permitted to, distribute actual datasets used with the OSS components listed below. You agree and are limited to distribute only links to datasets from known sources by listing them in the datasets overview table below. You are permitted to distribute derived datasets of data sets from known sources by including links to original dataset source in the datasets overview table below. You agree that any right to modify datasets originating from parties other than Megagon Labs, Inc. are governed by the respective third party’s license conditions. All OSS components and datasets are distributed WITHOUT ANY WARRANTY, without even implied warranty such as for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE, and without any liability to or claim against any Megagon Labs, Inc. entity other than as explicitly documented in this README document. You agree to cease using any part of the provided materials if you do not agree with the terms or the lack of any warranty herein. While Megagon Labs, Inc., makes commercially reasonable efforts to ensure that citations in this document are complete and accurate, errors may occur. If you see any error or omission, please help us improve this document by sending information to contact_oss@megagon.ai.

Contact

If you have any questions regarding the code and the paper, please directly contact Grace Fan (fan.gr@northeastern.edu).