snap-stanford / relbench

RelBench: Relational Deep Learning Benchmark
https://relbench.stanford.edu
MIT License
181 stars 31 forks source link

raw and processed data upload and download logic #13

Closed rishabh-ranjan closed 9 months ago

rishabh-ranjan commented 10 months ago

Two tasks here:

Details for the first part.

Currently, in class Dataset we have def __init__(self, root).

This should become def __init__(self, root, download=False, use_raw=False).

The intended semantics are:

This logic needs to be implemented. This is only suggestive, the implementer can make changes.

Details for the second part.

Currently rtb/datasets/product.py has ProductDataset which implements the processing of the Book subset of the Amazon reviews dataset. To get a working version of the processed dataset follow these steps:

  1. download and gunzip Books.json and meta_Books.json from https://datarepo.eng.ucsd.edu/mcauley_group/data/amazon_v2/categoryFiles/Books.json.gz and https://datarepo.eng.ucsd.edu/mcauley_group/data/amazon_v2/metaFiles2/meta_Books.json.gz respectively.
  2. store in data/rtb-product/raw/Books.json, data/rtb-product/raw/meta_Books.json and touch data/rtb-product/raw/done.
  3. load in python like this
    from rtb.datasets.product import ProductDataset
    ds = ProductDataset(root="data/") # this will process, save as parquet and load from parquet
    print(ds._db) # this is the full Database object

The parquet files at data/rtb-product/processed/db/ need to be uploaded and the corresponding download logic implemented.

rishabh-ranjan commented 10 months ago

Hi @rusty1s, can you take care of this? Otherwise, happy to do myself too when I get time. Not too urgent.

rishabh-ranjan commented 9 months ago

added backbone logic for this in #24. still need to test/implement some download/unzip parts

rishabh-ranjan commented 9 months ago

closing in #53