reczoo / BARS

BARS: Towards Open Benchmarking for Recommender Systems https://openbenchmark.github.io/BARS
Apache License 2.0
342 stars 55 forks source link

Data format mismatch #5

Closed sdfsfdx closed 2 years ago

sdfsfdx commented 2 years ago

I followed the instruction in this website: https://github.com/openbenchmark/Open-CTR-Benchmark/blob/master/benchmarks/AFM/AFM_Avazu_x4_001.md, but I found data format mismatch.

In data preprocess:https://github.com/openbenchmark/Open-CTR-Benchmark/blob/master/datasets/Avazu/Avazu_x4/split_avazu_x4.py, the format is '.csv'.

But in running config: https://github.com/openbenchmark/Open-CTR-Benchmark/blob/master/benchmarks/AFM/AFM_avazu_x4_tuner_config_01.yaml, the format is '*.h5'

Tesla-1i commented 2 years ago

please follow https://github.com/huawei-noah/benchmark/tree/main/FuxiCTR/README.md

zhujiem commented 2 years ago

@sdfsfdx Thanks for raising the inconsistency question. Yes, it is indeed our documentation problem. It requires many efforts. We are still working on it. We will make an update this week and respond to you then.

BTW, we found that the data splits may be not consistent then the random seed has been fixed. Would you please help check your md5sum values of train.csv, valid.csv, and test.csv? Thanks!

sdfsfdx commented 2 years ago

@Tesla-1i Thanks, the data format in this example is '*.csv', I will follow it after a while.

sdfsfdx commented 2 years ago

@zhujiem I executed 'process.py' to split Avazu dataset on the day I opened this issue, and md5sum of the three generated files are as follows:

2fa3064a7b7b9d6d4f0d77aa65d6998f test.csv 00130fdcd6737fdcce778c8000357590 train.csv 75fbb640b9d1d88b64bd55579130f352 valid.csv

zhujiem commented 2 years ago

@sdfsfdx The splitting is really different to ours. We will update the preprocessing code this week to ensure consistency. For now, could you follow the following Solution#2 to get the split data?

For how to convert csv to h5 data, you can run to reproduce LR first (LR is configurated with csv files), which will generate h5 data, and then h5 data can be reused in other model configurations. See https://github.com/openbenchmark/Open-CTR-Benchmark/blob/master/benchmarks/LR/LR_Avazu_x4_001.md

Avazu_x4

sdfsfdx commented 2 years ago

@zhujiem Thank you for your help and patience! I followed the Solution#2 and get the preprocessed dataset with the same md5sum values.

You solved my problem, I close this issue. Thanks!