Closed psyhtest closed 4 years ago
Just a note that generating a fake dataset with only two day's worth of data required a bit of effort. Despite num_days
being a command line parameter to quickgen.py
, it is actually hardcoded for kaggle, terabyte0875 and terabyte.
@mnaumovfb please share your thoughts on this. Taking a day to run accuracy checking seems extreme.
Even though the Criteo Terabyte dataset has ~400GB, the inference run is done only on the 1st half of the last day ~8GB = 400GB/ (24*2) days. The 2nd half of the last day is used for quantization calibration (I will add the instructions for it shortly).
Having said that, the code does need access to the full dataset, so that it can pre-process it (including the last day). Therefore, the quickgen will generate all 24 days of data for --profile=terabyte[0875]
option, using random numbers to fill-in these days. The quantity of samples per day is controlled by --num-samples
parameter and can be as low as 4096 (default). To summarize, the code needs to see all 24 days, but the number of samples per day can be as small as needed.
The processing step needs to process all 24 days, which will take a long time to generate the .bin files. Every time the inference is run, you just need to run on day 23 for inference testing and accuracy. Inference target accuracy is test accuracy/AUC, not for validation accuracy. Please try the offline mode. @aaronzhongii
@mnaumovfb please check if this CPU command line is correct: ./run_local.sh pytorch dlrm terabyte cpu --accuracy --scenario Offline --max-ind-range=40000000
Here is the full output of the run, i add time print before lg.ConstructSUT and after lg.DestroySUT(sut), you can see it tooks almost 11 days to finish an accuracy run
(py3) xxxxx@xxxxx:/mnt/ssd/aaronzhong/dlrm/inference/v0.5/recommendation$ ./run_local.sh pytorch dlrm terabyte cpu --accuracy --scenario Offline --max-ind-range=40000000
@mnaumovfb will continue to investigate with larger batch sizes. @aaronzhongii observed the accuracy script is using only 1 core of all the available cores. Last comment showed 11 days to complete.
@aaronzhongii Can you please try the following and let me know if it works faster for you?
On line 39 In the main.py set QUERY_LEN_CAP = 1
.
Then, run the command
./run_local.sh pytorch dlrm terabyte cpu --accuracy --scenario Offline --max-ind-range=40000000 --samples-to-aggregate-fix=4096 --max-batchsize=4096
@dkorchevgithub and I have investigated this issue. It seems that the slowdown in runtime with --accuracy
flag is coming from the logging mechanism in loadgen, which uses a mutex to write into the log file one query sample at a time, see LogAccuracy function in logging.cc.
Note that the above command effectively exchanges 2048 query samples (default QUERY_LEN_CAP = 2048
), for a single query with 4096 samples (that will be written at once). We expect this to be significantly faster than the prior command, while still go through the dataset and computing the same metrics at the end.
by using "top" to check the cpu usage of the python script,
@aaronzhongii Can you please try the following and let me know if it works faster for you?
- On line 39 In the main.py set
QUERY_LEN_CAP = 1
.- Then, run the command
./run_local.sh pytorch dlrm terabyte cpu --accuracy --scenario Offline --max-ind-range=40000000 --samples-to-aggregate-fix=4096 --max-batchsize=4096
@dkorchevgithub and I have investigated this issue. It seems that the slowdown in runtime with
--accuracy
flag is coming from the logging mechanism in loadgen, which uses a mutex to write into the log file one query sample at a time, see LogAccuracy function in logging.cc.Note that the above command effectively exchanges 2048 query samples (default
QUERY_LEN_CAP = 2048
), for a single query with 4096 samples (that will be written at once). We expect this to be significantly faster than the prior command, while still go through the dataset and computing the same metrics at the end.
I followed the instruction and the test script is running now. The cpu usage of the python is at 600% right now, comparing to 100% previously, so it would be 6 times faster, so I expect the test will finish in two days instead of 11.
I have noticed that using 2048 (rather than 4096) could further speedup the run. Maybe something to try in the future.
--samples-to-aggregate-fix=2048 --max-batchsize=2048
new log with 4096. Looks it finish in 2 hours
./run_local.sh pytorch dlrm terabyte cpu --accuracy --scenario Offline --max-ind-range=40000000 --samples-to-aggregate-fix=4096 --max-batchsize=4096
I guess the runtime length is resolved then. Please let me know if anything else is still missing.
fix verified. @mnaumovfb will update a small change in the accuracy mode only to speed up the writing logs.
The following PR https://github.com/mlperf/inference/pull/674 adds a flexible way of sizing the offline query.
Now rather than changing the code "QUERY_LEN_CAP = 1", you just need to pass the argument --samples-per-query-offline=1
. The README file has been adjusted accordingly.
WG: Believe this is resolved already.
Here is the full output of the run, i add time print before lg.ConstructSUT and after lg.DestroySUT(sut), you can see it tooks almost 11 days to finish an accuracy run
(py3) xxxxx@xxxxx:/mnt/ssd/aaronzhong/dlrm/inference/v0.5/recommendation$ ./run_local.sh pytorch dlrm terabyte cpu --accuracy --scenario Offline --max-ind-range=40000000
- python python/main.py --profile dlrm-terabyte-pytorch --config ../mlperf.conf --model dlrm --model-path /mnt/ssd/aaronzhong/dlrm/inference/v0.5/recommendation/model/dlrm_terabyte.pytorch --dataset terabyte --dataset-path /mnt/ssd/aaronzhong/dataset/criteo --output /mnt/ssd/aaronzhong/dlrm/inference/v0.5/recommendation/output/pytorch-cpu/dlrm --accuracy --scenario Offline --max-ind-range=40000000 INFO:main:Namespace(accuracy=True, backend='pytorch-native', cache=0, config='../mlperf.conf', count_queries=None, count_samples=None, data_sub_sample_rate=0.0, dataset='terabyte', dataset_path='/mnt/ssd/aaronzhong/dataset/criteo', duration=None, find_peak_performance=False, inputs=['continuous and categorical features'], max_batchsize=2048, max_ind_range=40000000, max_latency=None, mlperf_bin_loader=False, model='dlrm', model_path='/mnt/ssd/aaronzhong/dlrm/inference/v0.5/recommendation/model/dlrm_terabyte.pytorch', numpy_rand_seed=123, output='/mnt/ssd/aaronzhong/dlrm/inference/v0.5/recommendation/output/pytorch-cpu/dlrm', outputs=['probability'], profile='dlrm-terabyte-pytorch', samples_per_query=None, samples_to_aggregate_fix=None, samples_to_aggregate_max=None, samples_to_aggregate_min=None, samples_to_aggregate_quantile_file=None, samples_to_aggregate_trace_file='dlrm_trace_of_aggregated_samples.txt', scenario='Offline', target_qps=None, test_num_workers=0, threads=96, use_gpu=False) Using CPU... Reading pre-processed data=/mnt/ssd/aaronzhong/dataset/criteo/terabyte_processed.npz Sparse features= 26, Dense features= 13 Using fixed query size: 1 2020-07-07 11:23:24.804953 INFO:main:starting TestScenario.Offline TestScenario.Offline qps=68.65, mean=0.1348, time=634.016, acc=96.586%, auc=62.899%, queries=43525, tiles=50.0:0.1260,80.0:0.1324,90.0:0.1384,95.0:0.1474,99.0:0.1937,99.9:0.2253 2020-07-18 06:28:28.515641 i got the same to you , how did you fix it to 80%?
The processing step needs to process all 24 days, which will take a long time to generate the .bin files. Every time the inference is run, you just need to run on day 23 for inference testing and accuracy. Inference target accuracy is test accuracy/AUC, not for validation accuracy. Please try the offline mode. @aaronzhongii
did you mean : i should use the full day_23 to preprocess and then inference to auc to 80%? if use part of day_23 auc just up to 62%?
Given the size of the Criteo dataset (343 GB), it's a bit scary to launch a full accuracy run on the real data.
I was hoping that it would be possible to run it on a single day's data only. However, when I generated a fake dataset with only two day's worth of data (see below), the reference implementation failed to run:
So I suspect the same thing will happen if I put one real file into a separate directory and point the benchmark to that.