intel-cloud / cosbench

a benchmark tool for cloud object storage service
Other
573 stars 242 forks source link

Drivers (s3) get stuck when a large number of failures are seen #228

Open ddeflyer opened 9 years ago

ddeflyer commented 9 years ago

In a highly degenerate case where hardware is overwhelmed and failing many writes, we are seeing one or more drivers getting stuck past the time based termination point for a test. These drivers are not doing any work, they are just sitting idle but not finished.

This is observed in 4.0.0.0 using the s3 connector. Please let me know what information would be useful to track down the issue.

ywang19 commented 9 years ago

it would be helpful if you could paste the workload xml, corresponding log/system.log and workload.log. Also, if allowed, you could have a try on latest v0.4.1.0 beta1.

ddeflyer commented 9 years ago

workload-config.xml:

<?xml version="1.0" encoding="UTF-8"?>

ddeflyer commented 9 years ago

workload-log.txt is 1.3 MB so I've uploaded it and the xml to a shared google drive at: https://drive.google.com/folderview?id=0B1exXslkMeyJczRLaTdSSWdXeFU&usp=sharing

ddeflyer commented 9 years ago

I can't run the test again for a few weeks as the cluster is in use for other purposes. Once the cluster is free'ed up I will give it a try.

ywang19 commented 9 years ago

Understood. We will look at your log to see if any clues.

From: ddeflyer [mailto:notifications@github.com] Sent: Thursday, December 18, 2014 10:32 AM To: intel-cloud/cosbench Cc: Wang, Yaguang Subject: Re: [cosbench] Drivers (s3) get stuck when a large number of failures are seen (#228)

I can't run the test again for a few weeks as the cluster is in use for other purposes. Once the cluster is free'ed up I will give it a try.

— Reply to this email directly or view it on GitHubhttps://github.com/intel-cloud/cosbench/issues/228#issuecomment-67432918.

ddeflyer commented 9 years ago

Ok, I have access to the cluster for a few days. I am going to try running the test in a bit. If there is anything else that I should try or data to gather then please let me know before too long as I might not have access after a few days.