Open laekov opened 6 months ago
Hey, We use split jsonl files with pre-processing enabled. Each line is an example of the dataset. An example line is - {"label":1.0,"dense":[2.5649492740631104,3.044522523880005,1.3862943649291992,1.3862943649291992,1.0986123085021973,1.3862943649291992,2.70805025100708,3.7841897010803223,3.8712010383605957,1.0986123085021973,1.3862943649291992,0.0,1.0986123085021973],"sparse":[20,201,3138,2411,0,1,735,1,0,696,153,3017,145,2,2955,2749,0,1585,0,3,2888,0,1,1581,4,335]}
Should the sparse features be converted from the 32 bit hex IDs to contiguous indicies? (similar to the day_X_processed.npz
for TorchRec)
So I have forgotten what Torchrec needs. We convert the hex ids to integers, where unique ids are assigned a unique integer. I am happy to share pre-processed data if it helps you.
I am happy to share pre-processed data if it helps you.
Sure. That would be great!
It will also help if you can share with me your script to create the JSONL from npz or the raw dataset.
Okay, share your email, I can send you a link to download data.
My email is 'laekov.h@gmail.com'
Thanks
Shared the data file. Replace the csv processed file with the folder I have shared with you.
Get. I will have a look. Thank you for your help
Hi. I read your paper and find your ideas interesting. Thank you for opening your source code.
However, when I try to run the Oracle Cacher, I cannot find an indication on how to get the
kaggle_criteo_weekly.txt
that is required by--processed-csv
. Can you please give me some instructions on how to generate that file from the criteo kaggle or terabytes datasets?Also, I saw that your
CSVLoader
usesorjson
to parse every line. So, I am confused whether it is actually a CSV file or JSONL file? (I have both csv and npz versions of the terabytes dataset. But neither of them seems to work for that argument.)