Closed sdfsfdx closed 2 years ago
@sdfsfdx Thanks for raising the inconsistency question. Yes, it is indeed our documentation problem. It requires many efforts. We are still working on it. We will make an update this week and respond to you then.
BTW, we found that the data splits may be not consistent then the random seed has been fixed. Would you please help check your md5sum values of train.csv, valid.csv, and test.csv? Thanks!
@Tesla-1i Thanks, the data format in this example is '*.csv', I will follow it after a while.
@zhujiem I executed 'process.py' to split Avazu dataset on the day I opened this issue, and md5sum of the three generated files are as follows:
2fa3064a7b7b9d6d4f0d77aa65d6998f test.csv 00130fdcd6737fdcce778c8000357590 train.csv 75fbb640b9d1d88b64bd55579130f352 valid.csv
@sdfsfdx The splitting is really different to ours. We will update the preprocessing code this week to ensure consistency. For now, could you follow the following Solution#2 to get the split data?
For how to convert csv to h5 data, you can run to reproduce LR first (LR is configurated with csv files), which will generate h5 data, and then h5 data can be reused in other model configurations. See https://github.com/openbenchmark/Open-CTR-Benchmark/blob/master/benchmarks/LR/LR_Avazu_x4_001.md
Dataset description This dataset contains about 10 days of labeled click-through data on mobile advertisements. It has 22 feature fields including user features and advertisement attributes. Following the same setting in the AutoInt work, we split the data randomly into 8:1:1 as the training set, validation set, and test set, respectively. For better reproduciblity, we directly reuse the code provided by AutoInt and control the random seed (i.e., seed=2018) for splitting. The preprocessed data are accessible from the BARS benchmark.
How to get the dataset?
$ cd datasets/Avazu/Avazu_x4
$ python split_avazu_x4.py
$ md5sum train.csv valid.csv test.csv
de3a27264cdabf66adf09df82328ccaa train.csv
33232931d84d6452d3f956e936cab2c9 valid.csv
3ebb774a9ca74d05919b84a3d402986d test.csv
@zhujiem Thank you for your help and patience! I followed the Solution#2 and get the preprocessed dataset with the same md5sum values.
You solved my problem, I close this issue. Thanks!
I followed the instruction in this website: https://github.com/openbenchmark/Open-CTR-Benchmark/blob/master/benchmarks/AFM/AFM_Avazu_x4_001.md, but I found data format mismatch.
In data preprocess:https://github.com/openbenchmark/Open-CTR-Benchmark/blob/master/datasets/Avazu/Avazu_x4/split_avazu_x4.py, the format is '.csv'.
But in running config: https://github.com/openbenchmark/Open-CTR-Benchmark/blob/master/benchmarks/AFM/AFM_avazu_x4_tuner_config_01.yaml, the format is '*.h5'