Measure inference time - Githubissues

PGijsbers commented 1 year ago

help wanted: I am posting this now since it would be nice to have some feedback. I need merge this very soon in order to run experiments on time.

This PR introduces improved measurements for inference time, there are two major changes:

inference time is now to be measured the same way across frameworks: everything from reading from disk to producing predictions
inference time is measured for different dataset sizes, created by sampling with replacement from the test set

A few notes:

[x] I think we should allow also a second measurement which measures the inference time based on a Python object input (e.g., dataframe), for frameworks that support that. It would be nice to also know what the inference time is without the overhead of loading the file (which for small batches probably dominates). (edtit: example for autogluon present)
[ ] The implementation is probably not ideal. To be quite honest, this touches parts of the benchmark I hadn’t worked with before so I had to adjust as I go. If this is acceptable, I would rather prefer to refactor it later instead.
[x] Currently we only use one subsample per inference batch size, we should generate multiple. We should also make the amount of measurements configurable.
[x] I would like to also store each individual measurement, not just the aggregate.
- [x] This way we can do things like calculate different aggregates (median/mean) and/or exclude some measurements (some models may have a cold-start, which should be excluded from the aggregate).
[x] ~I need to add a way to clean up the generated files.~ They are currently cleaned up by virtue of them being stored in the temporary directory (if the framework uses run_in_venv). It might be nicer to generate and delete them right when doing the inference measurements in the subprocess, but in that case we need to be able to generate the splits without the OpenMLDataset object. Future work.

Current implementation’s result file after running python runbenchmark.py FRAMEWORK -t iris -f 0 for constantpredictor, tpot, and autogluon:

id,task,framework,constraint,fold,type,result,metric,mode,version,params,app_version,utc,duration,training_duration,predict_duration,models_count,seed,info,acc,balacc,logloss,inference_10000_rows,inference_1000_rows,inference_1_rows,models_ensemble_count
openml.org/t/59,iris,constantpredictor,test,0,multiclass,-1.09861,neg_logloss,local,1.2.2,,"dev [measure_inference_time, 71e7a8b]",2023-06-12T09:58:32,0.2,0.0002,4e-05,1,1757794907,,0.333333,0.333333,1.09861,0.00123646,0.00113094,0.00232852,
openml.org/t/59,iris,TPOT,test,0,multiclass,-0.0131854,neg_logloss,local,0.12.0,,"dev [measure_inference_time, 71e7a8b]",2023-06-12T09:59:55,76.4,65.4,0.0002,368,1156006760,,1.0,1.0,0.0131854,0.0196556,0.00544026,0.0176911,
openml.org/t/59,iris,AutoGluon,test,0,multiclass,-0.746956,neg_logloss,local,0.7.0,,"dev [measure_inference_time, 71e7a8b]",2023-06-12T10:00:30,28.4,18.7,0.002,14,2011490481,,0.933333,0.933333,0.746956,0.0672353,0.0116372,0.00261562,2.0

Notes:

Measurements for TPOT are inaccurate
We might run into issues for large datasets (many features), for the experiments we will simply (re)run those datasets a more lenient job timeout (or none at all).

Innixma commented 1 year ago

Moreover, I was wondering whether you could also retrieve the times for 10 and 100 samples? Or is this something that doesn't appear in real-life (maybe @Innixma knows?)?

@mfeurer It does occur, but it is generally an easier scenario than batch 1 and large-batch.

10 and 100 samples will often have near identical total latency as 1 sample. It is only once you go beyond 100 (and often 1000), that you start to avoid fixed-cost overheads dominating the runtime.

Given they are passing 100 samples, they probably didn't get those 100 samples all at the same time and instead waiting to group them together. The fact that they waited to group them rather than sending them in 1 at a time is an indicator that the scenario isn't very time sensitive. This isn't always the case (for example, maybe they have to batch or else its too slow), but that describes a lot of the production deployments that would send small batches of data.

I'd be ok both with and without those 10 & 100 measurements. I think 1 and 10000+ are the most important as they are the most challenging. I'd lean towards including them for the sake of completeness.

PGijsbers commented 1 year ago

status update (and note to self):

processed most feedback
added an example for inference using a dataframe (see autogluon), plan to expand to python frameworks for single row inference speed
having issues getting the arff split files to work with the H2O integration. It gives an error that "no columns in common" even though manual checking seems to suggest that the files are fine. It seems to be related to single-row inference measurements (2+ rows work fine).

Once the last two points are resolved, I plan to merge (this and the openml PR) and start running validation tests.

openml / automlbenchmark

Measure inference time #532