raw and processed data upload and download logic

rishabh-ranjan commented 10 months ago

Two tasks here:

[ ] Implement the raw vs processed download logic in the Dataset class. Optionally add upload logic.
[ ] Upload the raw and processed versions of the Amazon Products dataset, and complete the corresponding download/upload code.

Details for the first part.

Currently, in class Dataset we have def __init__(self, root).

This should become def __init__(self, root, download=False, use_raw=False).

The intended semantics are:

if download=True, then the dataset will be downloaded irrespective of whether the downloaded dirs exist locally (this allows refreshing datasets under possible corruption).
if download=False then the dataset must exist locally, or an error is thrown.
if use_raw=True the raw files are to be downloaded in raw/ (if required or download=True) and processed using self.process and finally loaded.
if use_raw=False the processed files are to be download in processed/ (if required or download=True) and loaded.

This logic needs to be implemented. This is only suggestive, the implementer can make changes.

Details for the second part.

Currently rtb/datasets/product.py has ProductDataset which implements the processing of the Book subset of the Amazon reviews dataset. To get a working version of the processed dataset follow these steps:

download and gunzip Books.json and meta_Books.json from https://datarepo.eng.ucsd.edu/mcauley_group/data/amazon_v2/categoryFiles/Books.json.gz and https://datarepo.eng.ucsd.edu/mcauley_group/data/amazon_v2/metaFiles2/meta_Books.json.gz respectively.
store in data/rtb-product/raw/Books.json, data/rtb-product/raw/meta_Books.json and touch data/rtb-product/raw/done.

load in python like this

from rtb.datasets.product import ProductDataset
ds = ProductDataset(root="data/") # this will process, save as parquet and load from parquet
print(ds._db) # this is the full Database object

The parquet files at data/rtb-product/processed/db/ need to be uploaded and the corresponding download logic implemented.

rishabh-ranjan commented 10 months ago

Hi @rusty1s, can you take care of this? Otherwise, happy to do myself too when I get time. Not too urgent.

rishabh-ranjan commented 9 months ago

added backbone logic for this in #24. still need to test/implement some download/unzip parts

rishabh-ranjan commented 9 months ago

closing in #53

snap-stanford / relbench

raw and processed data upload and download logic #13