intel-cloud / cosbench

a benchmark tool for cloud object storage service
Other
573 stars 242 forks source link

Collect and view results for user-terminated jobs #178

Closed seagate-bt closed 10 years ago

seagate-bt commented 10 years ago

tl;dr It would be great if you could select to prematurely complete a job such that the data is still collected and reported.

Currently with COSBench, after a job has completed, the results from the drivers and collected and reported, such as Bandwith, Response-Time, Success-Ratio, etc. If a job is terminated during the "main" phase, then the status shows as terminated and no data is collected and reported. Hypothetical Scenario: You have launched a long-running 20-hr job in COSBench to prove a new Swift cluster, for example. What if you are at the 18th-hour and the network has to be restarted to install critical patches, say for the Heartbleed OpenSSL bug, and they will not wait for 2 more hours for the job to complete. From the GUI, I select the Job and click a button to prematurely complete the job. It gracefully stops the work on the Drivers and gathers the data on the Controller for viewing. I can then see the BW, RespTime, Succ-Rat, etc. for the 18 of the 20 hours. The status does not need to say "Success", instead perhaps "Conditional-Success" or "Provisional-Success" or "Terminated-Success" to reflect that the job as submitted did not complete, but it was not an error that caused the job to terminate but directed by the operator.

I don't know if this is feasible, to communicate with the Drivers in this fashion. Mainly, I would like the ability to still view the statistical data for jobs that do not complete.

ywang19 commented 10 years ago

it's certainly feasible, actually, controller is already got data points before the failure time, and we currently just discard them. will consider to support it. btw, v0.4.0 beta2 will be uploaded this week, which includes one fix to avoid termination at long run, you may try it before this issue is resolved.

ywang19 commented 10 years ago

it's already included in latest v0.4.0.1 code base.