openml / automlbenchmark

OpenML AutoML Benchmarking Framework
https://openml.github.io/automlbenchmark
MIT License
391 stars 130 forks source link

Measure inference time #532

Closed PGijsbers closed 1 year ago

PGijsbers commented 1 year ago

help wanted: I am posting this now since it would be nice to have some feedback. I need merge this very soon in order to run experiments on time.


This PR introduces improved measurements for inference time, there are two major changes:

A few notes:

Current implementation’s result file after running python runbenchmark.py FRAMEWORK -t iris -f 0 for constantpredictor, tpot, and autogluon:

id,task,framework,constraint,fold,type,result,metric,mode,version,params,app_version,utc,duration,training_duration,predict_duration,models_count,seed,info,acc,balacc,logloss,inference_10000_rows,inference_1000_rows,inference_1_rows,models_ensemble_count
openml.org/t/59,iris,constantpredictor,test,0,multiclass,-1.09861,neg_logloss,local,1.2.2,,"dev [measure_inference_time, 71e7a8b]",2023-06-12T09:58:32,0.2,0.0002,4e-05,1,1757794907,,0.333333,0.333333,1.09861,0.00123646,0.00113094,0.00232852,
openml.org/t/59,iris,TPOT,test,0,multiclass,-0.0131854,neg_logloss,local,0.12.0,,"dev [measure_inference_time, 71e7a8b]",2023-06-12T09:59:55,76.4,65.4,0.0002,368,1156006760,,1.0,1.0,0.0131854,0.0196556,0.00544026,0.0176911,
openml.org/t/59,iris,AutoGluon,test,0,multiclass,-0.746956,neg_logloss,local,0.7.0,,"dev [measure_inference_time, 71e7a8b]",2023-06-12T10:00:30,28.4,18.7,0.002,14,2011490481,,0.933333,0.933333,0.746956,0.0672353,0.0116372,0.00261562,2.0

Notes:

Innixma commented 1 year ago

Moreover, I was wondering whether you could also retrieve the times for 10 and 100 samples? Or is this something that doesn't appear in real-life (maybe @Innixma knows?)?

@mfeurer It does occur, but it is generally an easier scenario than batch 1 and large-batch.

10 and 100 samples will often have near identical total latency as 1 sample. It is only once you go beyond 100 (and often 1000), that you start to avoid fixed-cost overheads dominating the runtime.

Given they are passing 100 samples, they probably didn't get those 100 samples all at the same time and instead waiting to group them together. The fact that they waited to group them rather than sending them in 1 at a time is an indicator that the scenario isn't very time sensitive. This isn't always the case (for example, maybe they have to batch or else its too slow), but that describes a lot of the production deployments that would send small batches of data.

I'd be ok both with and without those 10 & 100 measurements. I think 1 and 10000+ are the most important as they are the most challenging. I'd lean towards including them for the sake of completeness.

PGijsbers commented 1 year ago

status update (and note to self):