DLRM: is it possible to run on a single day's data only?

psyhtest commented 4 years ago

Given the size of the Criteo dataset (343 GB), it's a bit scary to launch a full accuracy run on the real data.

I was hoping that it would be possible to run it on a single day's data only. However, when I generated a fake dataset with only two day's worth of data (see below), the reference implementation failed to run:

anton@velociti:/datasets/mlperf-inference-v0.7/inference/v0.5/recommendation$ ./run_local.sh pytorch dlrm terabyte cpu --accuracy --scenario Offline --max-ind-range=10000000 --data-sub-sample-rate=0.875
...
scenario='Offline', target_qps=None, test_num_workers=0, threads=20, use_gpu=False)
Using CPU...
Reading raw data=/datasets/mlperf-inference-v0.7/inference/v0.5/recommendation/fake_criteo_2_days/day
Reading data from path=/datasets/mlperf-inference-v0.7/inference/v0.5/recommendation/fake_criteo_2_days/day_0
Reading data from path=/datasets/mlperf-inference-v0.7/inference/v0.5/recommendation/fake_criteo_2_days/day_1
ERROR: Criteo Terabyte Dataset path is invalid; please download from https://labs.criteo.com/2013/12/download-terabyte-click-logs

So I suspect the same thing will happen if I put one real file into a separate directory and point the benchmark to that.

psyhtest commented 4 years ago

Just a note that generating a fake dataset with only two day's worth of data required a bit of effort. Despite num_days being a command line parameter to quickgen.py, it is actually hardcoded for kaggle, terabyte0875 and terabyte.

christ1ne commented 4 years ago

@mnaumovfb please share your thoughts on this. Taking a day to run accuracy checking seems extreme.

mnaumovfb commented 4 years ago

Even though the Criteo Terabyte dataset has ~400GB, the inference run is done only on the 1st half of the last day ~8GB = 400GB/ (24*2) days. The 2nd half of the last day is used for quantization calibration (I will add the instructions for it shortly).

Having said that, the code does need access to the full dataset, so that it can pre-process it (including the last day). Therefore, the quickgen will generate all 24 days of data for --profile=terabyte[0875] option, using random numbers to fill-in these days. The quantity of samples per day is controlled by --num-samples parameter and can be as low as 4096 (default). To summarize, the code needs to see all 24 days, but the number of samples per day can be as small as needed.

christ1ne commented 4 years ago

The processing step needs to process all 24 days, which will take a long time to generate the .bin files. Every time the inference is run, you just need to run on day 23 for inference testing and accuracy. Inference target accuracy is test accuracy/AUC, not for validation accuracy. Please try the offline mode. @aaronzhongii

christ1ne commented 4 years ago

@mnaumovfb please check if this CPU command line is correct: ./run_local.sh pytorch dlrm terabyte cpu --accuracy --scenario Offline --max-ind-range=40000000

aaronzhongii commented 4 years ago

Here is the full output of the run, i add time print before lg.ConstructSUT and after lg.DestroySUT(sut), you can see it tooks almost 11 days to finish an accuracy run

(py3) xxxxx@xxxxx:/mnt/ssd/aaronzhong/dlrm/inference/v0.5/recommendation$ ./run_local.sh pytorch dlrm terabyte cpu --accuracy --scenario Offline --max-ind-range=40000000

python python/main.py --profile dlrm-terabyte-pytorch --config ../mlperf.conf --model dlrm --model-path /mnt/ssd/aaronzhong/dlrm/inference/v0.5/recommendation/model/dlrm_terabyte.pytorch --dataset terabyte --dataset-path /mnt/ssd/aaronzhong/dataset/criteo --output /mnt/ssd/aaronzhong/dlrm/inference/v0.5/recommendation/output/pytorch-cpu/dlrm --accuracy --scenario Offline --max-ind-range=40000000 INFO:main:Namespace(accuracy=True, backend='pytorch-native', cache=0, config='../mlperf.conf', count_queries=None, count_samples=None, data_sub_sample_rate=0.0, dataset='terabyte', dataset_path='/mnt/ssd/aaronzhong/dataset/criteo', duration=None, find_peak_performance=False, inputs=['continuous and categorical features'], max_batchsize=2048, max_ind_range=40000000, max_latency=None, mlperf_bin_loader=False, model='dlrm', model_path='/mnt/ssd/aaronzhong/dlrm/inference/v0.5/recommendation/model/dlrm_terabyte.pytorch', numpy_rand_seed=123, output='/mnt/ssd/aaronzhong/dlrm/inference/v0.5/recommendation/output/pytorch-cpu/dlrm', outputs=['probability'], profile='dlrm-terabyte-pytorch', samples_per_query=None, samples_to_aggregate_fix=None, samples_to_aggregate_max=None, samples_to_aggregate_min=None, samples_to_aggregate_quantile_file=None, samples_to_aggregate_trace_file='dlrm_trace_of_aggregated_samples.txt', scenario='Offline', target_qps=None, test_num_workers=0, threads=96, use_gpu=False) Using CPU... Reading pre-processed data=/mnt/ssd/aaronzhong/dataset/criteo/terabyte_processed.npz Sparse features= 26, Dense features= 13 Using fixed query size: 1 2020-07-07 11:23:24.804953 INFO:main:starting TestScenario.Offline TestScenario.Offline qps=68.65, mean=0.1348, time=634.016, acc=96.586%, auc=62.899%, queries=43525, tiles=50.0:0.1260,80.0:0.1324,90.0:0.1384,95.0:0.1474,99.0:0.1937,99.9:0.2253 2020-07-18 06:28:28.515641

christ1ne commented 4 years ago

@mnaumovfb will continue to investigate with larger batch sizes. @aaronzhongii observed the accuracy script is using only 1 core of all the available cores. Last comment showed 11 days to complete.

mnaumovfb commented 4 years ago

@aaronzhongii Can you please try the following and let me know if it works faster for you?

On line 39 In the main.py set QUERY_LEN_CAP = 1.
Then, run the command ./run_local.sh pytorch dlrm terabyte cpu --accuracy --scenario Offline --max-ind-range=40000000 --samples-to-aggregate-fix=4096 --max-batchsize=4096

@dkorchevgithub and I have investigated this issue. It seems that the slowdown in runtime with --accuracy flag is coming from the logging mechanism in loadgen, which uses a mutex to write into the log file one query sample at a time, see LogAccuracy function in logging.cc.

Note that the above command effectively exchanges 2048 query samples (default QUERY_LEN_CAP = 2048), for a single query with 4096 samples (that will be written at once). We expect this to be significantly faster than the prior command, while still go through the dataset and computing the same metrics at the end.

aaronzhongii commented 4 years ago

by using "top" to check the cpu usage of the python script,

@aaronzhongii Can you please try the following and let me know if it works faster for you?

On line 39 In the main.py set QUERY_LEN_CAP = 1.

Then, run the command ./run_local.sh pytorch dlrm terabyte cpu --accuracy --scenario Offline --max-ind-range=40000000 --samples-to-aggregate-fix=4096 --max-batchsize=4096

@dkorchevgithub and I have investigated this issue. It seems that the slowdown in runtime with --accuracy flag is coming from the logging mechanism in loadgen, which uses a mutex to write into the log file one query sample at a time, see LogAccuracy function in logging.cc.

Note that the above command effectively exchanges 2048 query samples (default QUERY_LEN_CAP = 2048), for a single query with 4096 samples (that will be written at once). We expect this to be significantly faster than the prior command, while still go through the dataset and computing the same metrics at the end.

I followed the instruction and the test script is running now. The cpu usage of the python is at 600% right now, comparing to 100% previously, so it would be 6 times faster, so I expect the test will finish in two days instead of 11.

mnaumovfb commented 4 years ago

I have noticed that using 2048 (rather than 4096) could further speedup the run. Maybe something to try in the future. --samples-to-aggregate-fix=2048 --max-batchsize=2048

aaronzhongii commented 4 years ago

new log with 4096. Looks it finish in 2 hours

./run_local.sh pytorch dlrm terabyte cpu --accuracy --scenario Offline --max-ind-range=40000000 --samples-to-aggregate-fix=4096 --max-batchsize=4096

python python/main.py --profile dlrm-terabyte-pytorch --config ../mlperf.conf --model dlrm --model-path /mnt/ssd/aaronzhong/dlrm/inference/v0.5/recommendation/model/dlrm_terabyte.pytorch --dataset terabyte --dataset-path /mnt/ssd/aaronzhong/dataset/criteo --output /mnt/ssd/aaronzhong/dlrm/inference/v0.5/recommendation/output/pytorch-cpu/dlrm --accuracy --scenario Offline --max-ind-range=40000000 --samples-to-aggregate-fix=4096 --max-batchsize=4096 INFO:main:Namespace(accuracy=True, backend='pytorch-native', cache=0, config='../mlperf.conf', count_queries=None, count_samples=None, data_sub_sample_rate=0.0, dataset='terabyte', dataset_path='/mnt/ssd/aaronzhong/dataset/criteo', duration=None, find_peak_performance=False, inputs=['continuous and categorical features'], max_batchsize=4096, max_ind_range=40000000, max_latency=None, mlperf_bin_loader=False, model='dlrm', model_path='/mnt/ssd/aaronzhong/dlrm/inference/v0.5/recommendation/model/dlrm_terabyte.pytorch', numpy_rand_seed=123, output='/mnt/ssd/aaronzhong/dlrm/inference/v0.5/recommendation/output/pytorch-cpu/dlrm', outputs=['probability'], profile='dlrm-terabyte-pytorch', samples_per_query=None, samples_to_aggregate_fix=4096, samples_to_aggregate_max=None, samples_to_aggregate_min=None, samples_to_aggregate_quantile_file=None, samples_to_aggregate_trace_file='dlrm_trace_of_aggregated_samples.txt', scenario='Offline', target_qps=None, test_num_workers=0, threads=96, use_gpu=False) Using CPU... Reading pre-processed data=/mnt/ssd/aaronzhong/dataset/criteo/terabyte_processed.npz Sparse features= 26, Dense features= 13 Using fixed query size: 4096 2020-07-24 16:53:30.157006 INFO:main:starting TestScenario.Offline TestScenario.Offline qps=35.63, mean=0.1818, time=610.850, acc=96.586%, auc=62.899%, queries=21763, tiles=50.0:0.1739,80.0:0.1779,90.0:0.1807,95.0:0.1843,99.0:0.2187,99.9:0.2337 2020-07-24 18:55:52.919941

mnaumovfb commented 4 years ago

I guess the runtime length is resolved then. Please let me know if anything else is still missing.

christ1ne commented 4 years ago

fix verified. @mnaumovfb will update a small change in the accuracy mode only to speed up the writing logs.

mnaumovfb commented 4 years ago

The following PR https://github.com/mlperf/inference/pull/674 adds a flexible way of sizing the offline query.

Now rather than changing the code "QUERY_LEN_CAP = 1", you just need to pass the argument --samples-per-query-offline=1. The README file has been adjusted accordingly.

TheKanter commented 4 years ago

WG: Believe this is resolved already.

kkkparty commented 6 months ago

Here is the full output of the run, i add time print before lg.ConstructSUT and after lg.DestroySUT(sut), you can see it tooks almost 11 days to finish an accuracy run

(py3) xxxxx@xxxxx:/mnt/ssd/aaronzhong/dlrm/inference/v0.5/recommendation$ ./run_local.sh pytorch dlrm terabyte cpu --accuracy --scenario Offline --max-ind-range=40000000

python python/main.py --profile dlrm-terabyte-pytorch --config ../mlperf.conf --model dlrm --model-path /mnt/ssd/aaronzhong/dlrm/inference/v0.5/recommendation/model/dlrm_terabyte.pytorch --dataset terabyte --dataset-path /mnt/ssd/aaronzhong/dataset/criteo --output /mnt/ssd/aaronzhong/dlrm/inference/v0.5/recommendation/output/pytorch-cpu/dlrm --accuracy --scenario Offline --max-ind-range=40000000 INFO:main:Namespace(accuracy=True, backend='pytorch-native', cache=0, config='../mlperf.conf', count_queries=None, count_samples=None, data_sub_sample_rate=0.0, dataset='terabyte', dataset_path='/mnt/ssd/aaronzhong/dataset/criteo', duration=None, find_peak_performance=False, inputs=['continuous and categorical features'], max_batchsize=2048, max_ind_range=40000000, max_latency=None, mlperf_bin_loader=False, model='dlrm', model_path='/mnt/ssd/aaronzhong/dlrm/inference/v0.5/recommendation/model/dlrm_terabyte.pytorch', numpy_rand_seed=123, output='/mnt/ssd/aaronzhong/dlrm/inference/v0.5/recommendation/output/pytorch-cpu/dlrm', outputs=['probability'], profile='dlrm-terabyte-pytorch', samples_per_query=None, samples_to_aggregate_fix=None, samples_to_aggregate_max=None, samples_to_aggregate_min=None, samples_to_aggregate_quantile_file=None, samples_to_aggregate_trace_file='dlrm_trace_of_aggregated_samples.txt', scenario='Offline', target_qps=None, test_num_workers=0, threads=96, use_gpu=False) Using CPU... Reading pre-processed data=/mnt/ssd/aaronzhong/dataset/criteo/terabyte_processed.npz Sparse features= 26, Dense features= 13 Using fixed query size: 1 2020-07-07 11:23:24.804953 INFO:main:starting TestScenario.Offline TestScenario.Offline qps=68.65, mean=0.1348, time=634.016, acc=96.586%, auc=62.899%, queries=43525, tiles=50.0:0.1260,80.0:0.1324,90.0:0.1384,95.0:0.1474,99.0:0.1937,99.9:0.2253 2020-07-18 06:28:28.515641 i got the same to you , how did you fix it to 80%？

kkkparty commented 6 months ago

The processing step needs to process all 24 days, which will take a long time to generate the .bin files. Every time the inference is run, you just need to run on day 23 for inference testing and accuracy. Inference target accuracy is test accuracy/AUC, not for validation accuracy. Please try the offline mode. @aaronzhongii

did you mean ： i should use the full day_23 to preprocess and then inference to auc to 80%? if use part of day_23 auc just up to 62%?

mlcommons / inference

DLRM: is it possible to run on a single day's data only? #607