mlcommons / inference

Reference implementations of MLPerf™ inference benchmarks
https://mlcommons.org/en/groups/inference
Apache License 2.0
1.23k stars 533 forks source link

DLRM: provide a single script for generating a fake dataset #606

Closed psyhtest closed 3 years ago

psyhtest commented 4 years ago

To generate a fake dataset, I've used the following command sequence:

$ cd $WORKSPACE/inference/v0.5/recommendation
$ python -m pip install numpy --user
$ cd tools/
$ ./make_fake_criteo.sh terabyte0875
$ mv fake_criteo/ ../

Here, make_fake_criteo.sh calls quickgen.py which is also in the tools/ directory. Hence, the need to descend there first.

However, all the make_fake_criteo.sh script does is it creates a new directory if it doesn't exist, checks that the only argument is one of the kaggle|terabyte0875|terabyte and passes it to quickgen.py:

python quickgen.py --num-samples=4096 --profile=$QUICKGEN_PROFILE --output-dir=$OUTPUT_DIR

Actually, it even gets in the way because it forcibly uses python not python3 (which are different on e.g. Ubuntu 16.04). For example, I didn't have NumPy installed for Python 2, hence had to install it first.

This functionality can be folded into quickgen.py itself.

psyhtest commented 4 years ago

Here's a good spot to error if the profile argument is not one of the kaggle|terabyte0875|terabyte (including the default value).

christ1ne commented 3 years ago

closing this @psyhtest for now and please reach out @mnaumovfb if needed.