Data format mismatch - Githubissues

sdfsfdx commented 2 years ago

I followed the instruction in this website: https://github.com/openbenchmark/Open-CTR-Benchmark/blob/master/benchmarks/AFM/AFM_Avazu_x4_001.md, but I found data format mismatch.

In data preprocess:https://github.com/openbenchmark/Open-CTR-Benchmark/blob/master/datasets/Avazu/Avazu_x4/split_avazu_x4.py, the format is '.csv'.

But in running config: https://github.com/openbenchmark/Open-CTR-Benchmark/blob/master/benchmarks/AFM/AFM_avazu_x4_tuner_config_01.yaml, the format is '*.h5'

Tesla-1i commented 2 years ago

please follow https://github.com/huawei-noah/benchmark/tree/main/FuxiCTR/README.md

zhujiem commented 2 years ago

@sdfsfdx Thanks for raising the inconsistency question. Yes, it is indeed our documentation problem. It requires many efforts. We are still working on it. We will make an update this week and respond to you then.

BTW, we found that the data splits may be not consistent then the random seed has been fixed. Would you please help check your md5sum values of train.csv, valid.csv, and test.csv? Thanks!

sdfsfdx commented 2 years ago

@Tesla-1i Thanks, the data format in this example is '*.csv', I will follow it after a while.

sdfsfdx commented 2 years ago

@zhujiem I executed 'process.py' to split Avazu dataset on the day I opened this issue, and md5sum of the three generated files are as follows:

2fa3064a7b7b9d6d4f0d77aa65d6998f test.csv 00130fdcd6737fdcce778c8000357590 train.csv 75fbb640b9d1d88b64bd55579130f352 valid.csv

zhujiem commented 2 years ago

@sdfsfdx The splitting is really different to ours. We will update the preprocessing code this week to ensure consistency. For now, could you follow the following Solution#2 to get the split data?

For how to convert csv to h5 data, you can run to reproduce LR first (LR is configurated with csv files), which will generate h5 data, and then h5 data can be reused in other model configurations. See https://github.com/openbenchmark/Open-CTR-Benchmark/blob/master/benchmarks/LR/LR_Avazu_x4_001.md

Avazu_x4

Dataset description This dataset contains about 10 days of labeled click-through data on mobile advertisements. It has 22 feature fields including user features and advertisement attributes. Following the same setting in the AutoInt work, we split the data randomly into 8:1:1 as the training set, validation set, and test set, respectively. For better reproduciblity, we directly reuse the code provided by AutoInt and control the random seed (i.e., seed=2018) for splitting. The preprocessed data are accessible from the BARS benchmark.

How to get the dataset?

Solution#1: Download the raw dataset, and run the following scripts:
```
$ cd datasets/Avazu/Avazu_x4
$ python split_avazu_x4.py
```
Solution#2: For ease of reuse, the preprocessed data are available for downloading here.

Check the md5sum for consistency.

$ md5sum train.csv valid.csv test.csv
de3a27264cdabf66adf09df82328ccaa  train.csv
33232931d84d6452d3f956e936cab2c9  valid.csv
3ebb774a9ca74d05919b84a3d402986d  test.csv

sdfsfdx commented 2 years ago

@zhujiem Thank you for your help and patience! I followed the Solution#2 and get the preprocessed dataset with the same md5sum values.

You solved my problem, I close this issue. Thanks!

reczoo / BARS

Data format mismatch #5

Avazu_x4