Closed rishabh-ranjan closed 9 months ago
Hi @rusty1s, can you take care of this? Otherwise, happy to do myself too when I get time. Not too urgent.
added backbone logic for this in #24. still need to test/implement some download/unzip parts
closing in #53
Two tasks here:
Dataset
class. Optionally add upload logic.Details for the first part.
Currently, in
class Dataset
we havedef __init__(self, root)
.This should become
def __init__(self, root, download=False, use_raw=False)
.The intended semantics are:
download=True
, then the dataset will be downloaded irrespective of whether the downloaded dirs exist locally (this allows refreshing datasets under possible corruption).download=False
then the dataset must exist locally, or an error is thrown.use_raw=True
the raw files are to be downloaded inraw/
(if required or download=True) and processed usingself.process
and finally loaded.use_raw=False
the processed files are to be download inprocessed/
(if required or download=True) and loaded.This logic needs to be implemented. This is only suggestive, the implementer can make changes.
Details for the second part.
Currently
rtb/datasets/product.py
hasProductDataset
which implements the processing of the Book subset of the Amazon reviews dataset. To get a working version of the processed dataset follow these steps:data/rtb-product/raw/Books.json
,data/rtb-product/raw/meta_Books.json
andtouch data/rtb-product/raw/done
.The parquet files at
data/rtb-product/processed/db/
need to be uploaded and the corresponding download logic implemented.