uw-mad-dash / bagpipe

Code for reproducing results for SOSP paper Bagpipe
MIT License
7 stars 2 forks source link

Unable to get `kaggle_criteo_weekly.txt` #1

Open laekov opened 6 months ago

laekov commented 6 months ago

Hi. I read your paper and find your ideas interesting. Thank you for opening your source code.

However, when I try to run the Oracle Cacher, I cannot find an indication on how to get the kaggle_criteo_weekly.txt that is required by --processed-csv. Can you please give me some instructions on how to generate that file from the criteo kaggle or terabytes datasets?

Also, I saw that your CSVLoader uses orjson to parse every line. So, I am confused whether it is actually a CSV file or JSONL file? (I have both csv and npz versions of the terabytes dataset. But neither of them seems to work for that argument.)

iidsample commented 6 months ago

Hey, We use split jsonl files with pre-processing enabled. Each line is an example of the dataset. An example line is - {"label":1.0,"dense":[2.5649492740631104,3.044522523880005,1.3862943649291992,1.3862943649291992,1.0986123085021973,1.3862943649291992,2.70805025100708,3.7841897010803223,3.8712010383605957,1.0986123085021973,1.3862943649291992,0.0,1.0986123085021973],"sparse":[20,201,3138,2411,0,1,735,1,0,696,153,3017,145,2,2955,2749,0,1585,0,3,2888,0,1,1581,4,335]}

laekov commented 6 months ago

Should the sparse features be converted from the 32 bit hex IDs to contiguous indicies? (similar to the day_X_processed.npz for TorchRec)

iidsample commented 6 months ago

So I have forgotten what Torchrec needs. We convert the hex ids to integers, where unique ids are assigned a unique integer. I am happy to share pre-processed data if it helps you.

laekov commented 6 months ago

I am happy to share pre-processed data if it helps you.

Sure. That would be great!

It will also help if you can share with me your script to create the JSONL from npz or the raw dataset.

iidsample commented 6 months ago

Okay, share your email, I can send you a link to download data.

laekov commented 6 months ago

My email is 'laekov.h@gmail.com'

Thanks

iidsample commented 6 months ago

Shared the data file. Replace the csv processed file with the folder I have shared with you.

laekov commented 6 months ago

Get. I will have a look. Thank you for your help