inaccurate or incomplete error reporting

gonewest818 commented 9 years ago

I have a workload that ran about half of the tasks and then terminated.

One of the "prepare" stages is considered "failed" with a success ratio of 96.51%, but every one of the 32 subordinate missions is marked "accomplished". There's no indication from the dashboard view which of the mission contains the failed write operations.

(1) if the stage has failed then it should be clear which of the missions had errors (2) whatever the error was, it should be discoverable from the ui and not just by inspecting logs

ywang19 commented 9 years ago

on the workload page, there is an error statistics div, which will count the error types, does it help?

gonewest818 commented 9 years ago

Not really. The workload page shows stages 2 and 8 "failed" and stage 40 was "terminated".

Viewing the error statistics div shows: stage 2, 100% successful stage 8, 96.51% successful stage 40, no statistics reported

If I click into "view details" next to each stage: stage 2, all missions are "accomplished" stage 8, all missions are "accomplished" stage 40, 31 out of 32 driver missions are "aborted". one driver has no mission ID but is "terminated"

In a nutshell my issue is, in a single test I have three stages that have completely inconsistently reported results, and no additional information with which to debug. What exactly failed in each case? I suppose I need to grep through 96 missions worth of output log to (maybe) find out. Clicking the "download-log" link results in 650MB of text to sort through, however. I can't easily grep the file because there's not enough identifying information on each log line to tell me which mission or which stage the error occurred.

Cosbench has great potential, but as you well know, debugging distributed systems are hard, so anything that can be done to ease debugging will promote adoption and use.

thanks-

ywang19 commented 9 years ago

Yes, ease to use should be very helpful for trouble shooting, especially for distributed systems. I know there are needs to enhance error handling and reporting, but I was stumbling on some other project. As this is an open source project, I really expect more people can involve in contributing to increase the diversity and usability.

From: gonewest818 [mailto:notifications@github.com] Sent: Tuesday, August 04, 2015 1:14 PM To: intel-cloud/cosbench Cc: Wang, Yaguang Subject: Re: [cosbench] inaccurate or incomplete error reporting (#280)

Not really. The workload page shows stages 2 and 8 "failed" and stage 40 was "terminated".

Viewing the error statistics div shows: stage 2, 100% successful stage 8, 96.51% successful stage 40, no statistics reported

If I click into "view details" next to each stage: stage 2, all missions are "accomplished" stage 8, all missions are "accomplished" stage 40, 31 out of 32 driver missions are "aborted". one driver has no mission ID but is "terminated"

In a nutshell my issue is, in a single test I have three stages that have completely inconsistently reported results, and no additional information with which to debug. What exactly failed in each case? I suppose I need to grep through 96 missions worth of output log to (maybe) find out. Clicking the "download-log" link results in 650MB of text to sort through, however. I can't easily grep the file because there's not enough identifying information on each log line to tell me which mission or which stage the error occurred.

Cosbench has great potential, but as you well know, debugging distributed systems are hard, so anything that can be done to ease debugging will promote adoption and use.

thanks-

— Reply to this email directly or view it on GitHubhttps://github.com/intel-cloud/cosbench/issues/280#issuecomment-127481374.

gonewest818 commented 9 years ago

Sure, I understand. If you're willing to keep these tickets around, then perhaps others can pick from the list and send you pull requests.

ywang19 commented 9 years ago

That’s my preference.

Thanks.

From: gonewest818 [mailto:notifications@github.com] Sent: Tuesday, August 04, 2015 2:16 PM To: intel-cloud/cosbench Cc: Wang, Yaguang Subject: Re: [cosbench] inaccurate or incomplete error reporting (#280)

Sure, I understand. If you're willing to keep these tickets around, then perhaps others can pick from the list and send you pull requests.

— Reply to this email directly or view it on GitHubhttps://github.com/intel-cloud/cosbench/issues/280#issuecomment-127491302.

intel-cloud / cosbench

inaccurate or incomplete error reporting #280