Closed jrgruher closed 9 years ago
(removed the braces so the text would show up here)
?xml version="1.0" encoding="UTF-8" ?
workload name="1C-3MO-4MB-384w-read-write" description="1 Container 3M Objects of 4MB each - 384 Workers - Read and Write"
storage type="librados" config="endpoint=10.0.0.2;accesskey=admin;secretkey=AQAeaT1UiAndExAA2J/l7ZarJrsuzr7y0VJccg==" /
workflow
workstage name="prepare" work type="prepare" config="containers=r(1,1);objects=r(1,3000000);sizes=c(4)MB" workers="8"/ /workstage
workstage name="delay" closuredelay="300" work type="delay" workers="1"/ /workstage
Hi Joe,
From the configuration for prepare stage, you will create 3 Million objects with size 4MB, 24 (12*2) TB data will be written (assuming replica=2). may you increase "timeout"? btw, what's the ceph version?
-yaguang
This is with Ceph Firefly 0.80.6. I am having some interesting results. Originally with a larger number of workers (like 96) it would run for a while with less than 1% completion rate and then exit the stage with only about 160GB having been written (this should write about 12TB). The stage would report success and the individual drivers would be a mix of success and fail. I lowered the number of workers to 8 and now I notice it still initially runs with a very low completion rate but after I left it alone for a while (walked away for about 20 minutes) it went up to 100% completion rate. Letting the stage run to completion now, at about 500MB/sec it will take hours. I’ll try some more experiments and see if I can better define the behavior. I may also be able to pull a CSV file from the logs that shows the completion rate over time?
thanks, and waiting for your log files for both cases.
Here is an example of logs from a failing case: http://www.filedropper.com/w6
I'll post the passing case when it completes.
The most recent run, despite reporting 100% completion rate in the 5 second updates for most of the run, eventually still failed and only achieved 56% success ratio. This was with only 8 workers in the init stage, and a well provisioned Ceph cluster, this should not have been much of a load on the cluster. We can see the average response time is quite low so I don't think we're overloading the cluster. The workload parameters must be correct since many of the IOs succeeded. What would cause the rest to fail?
Op-Type Op-Count Byte-Count Avg-ResTime Avg-ProcTime Throughput Bandwidth Succ-Ratio prepare-write 1.68 mops 6.73 TB 63.17 ms 63.17 ms 120.64 op/s 482.57 MB/S 56.08%
Logs are attached here: http://www.filedropper.com/w11
Later, with no changes (except we bumped the init workers from 8 to 16 to speed things up), the full workload ran with 100 success ratio all the way through. Logs: http://www.filedropper.com/w14
The issue went away without any changes, it may be caused by cluster stability. I will tentatively close it, and reopen it if got reproduced someday.
When I run COSBench prepare stage against my Ceph cluster I get a very low success ratio (less than 1% but greater than zero). I do get some successes and see some data added to the Ceph cluster so I assume my authentication credential are correct. Not sure where to look next for the problem. Ceph cluster seems healthy and responds well to other test tools.
I see many error of this form in the workload.log:
================================================== stage: s1-prepare ================================================== ---------------------------------- mission: M52F1F278C4, driver: driver01 ---------------------------------- 2014-10-20 12:53:31,086 [INFO] [Log4jLogManager] - will append log to file /home/ceph/0.4.0.e1/log/mission/M52F1F278C4.log 2014-10-20 12:53:31,383 [ERROR] [AbstractOperator] - fail to perform write operation com.intel.cosbench.api.storage.StorageException: com.ceph.rados.RadosException: Failed writing 4000000 bytes to myobjects1 (-2) at com.intel.cosbench.api.librados.LibradosStorage.createObject(LibradosStorage.java:147) at com.intel.cosbench.driver.operator.Writer.doWrite(Writer.java:96) at com.intel.cosbench.driver.operator.Preparer.operate(Preparer.java:95) at com.intel.cosbench.driver.operator.AbstractOperator.operate(AbstractOperator.java:71) at com.intel.cosbench.driver.agent.WorkAgent.performOperation(WorkAgent.java:179) at com.intel.cosbench.driver.agent.WorkAgent.doWork(WorkAgent.java:162) at com.intel.cosbench.driver.agent.WorkAgent.execute(WorkAgent.java:119) at com.intel.cosbench.driver.agent.AbstractAgent.call(AbstractAgent.java:44) at com.intel.cosbench.driver.agent.AbstractAgent.call(AbstractAgent.java:1) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: com.ceph.rados.RadosException: Failed writing 4000000 bytes to myobjects1 (-2) at com.ceph.rados.IoCTX.write(IoCTX.java:181) at com.intel.cosbench.api.librados.LibradosStorage.createObject(LibradosStorage.java:145) ... 12 more 2014-10-20 12:53:31,384 [ERROR] [AbstractOperator] - fail to perform write operation com.intel.cosbench.api.storage.StorageException: com.ceph.rados.RadosException: Failed writing 4000000 bytes to myobjects15626 (-2) at com.intel.cosbench.api.librados.LibradosStorage.createObject(LibradosStorage.java:147) at com.intel.cosbench.driver.operator.Writer.doWrite(Writer.java:96) at com.intel.cosbench.driver.operator.Preparer.operate(Preparer.java:95) at com.intel.cosbench.driver.operator.AbstractOperator.operate(AbstractOperator.java:71) at com.intel.cosbench.driver.agent.WorkAgent.performOperation(WorkAgent.java:179) at com.intel.cosbench.driver.agent.WorkAgent.doWork(WorkAgent.java:162) at com.intel.cosbench.driver.agent.WorkAgent.execute(WorkAgent.java:119) at com.intel.cosbench.driver.agent.AbstractAgent.call(AbstractAgent.java:44) at com.intel.cosbench.driver.agent.AbstractAgent.call(AbstractAgent.java:1) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: com.ceph.rados.RadosException: Failed writing 4000000 bytes to myobjects15626 (-2) at com.ceph.rados.IoCTX.write(IoCTX.java:181) at com.intel.cosbench.api.librados.LibradosStorage.createObject(LibradosStorage.java:145) ... 12 more
Clearly writes are failing but it is unclear to me why this would be... is there a way to get more detail?
No signs of trouble in system.log:
2014-10-20 12:53:30,814 [INFO] [WorkloadProcessor] - begin to process workload w6 2014-10-20 12:53:30,815 [INFO] [WorkloadProcessor] - begin to run stage s1-prepare 2014-10-20 12:53:30,815 [INFO] [WorkloadProcessor] - ============================================ 2014-10-20 12:53:30,815 [INFO] [WorkloadProcessor] - START WORK: prepare 2014-10-20 12:53:30,835 [INFO] [AbstractCommandTasklet] - time drift between controller and driver-driver09 is -170 mSec 2014-10-20 12:53:30,835 [INFO] [AbstractCommandTasklet] - time drift between controller and driver-driver16 is -169 mSec 2014-10-20 12:53:30,835 [INFO] [AbstractCommandTasklet] - time drift between controller and driver-driver15 is -168 mSec 2014-10-20 12:53:30,836 [INFO] [AbstractCommandTasklet] - time drift between controller and driver-driver12 is -168 mSec 2014-10-20 12:53:30,837 [INFO] [AbstractCommandTasklet] - time drift between controller and driver-driver11 is -167 mSec 2014-10-20 12:53:30,837 [INFO] [AbstractCommandTasklet] - time drift between controller and driver-driver13 is -168 mSec 2014-10-20 12:53:30,837 [INFO] [AbstractCommandTasklet] - time drift between controller and driver-driver10 is -167 mSec 2014-10-20 12:53:30,837 [INFO] [AbstractCommandTasklet] - time drift between controller and driver-driver08 is -169 mSec 2014-10-20 12:53:30,837 [INFO] [AbstractCommandTasklet] - time drift between controller and driver-driver05 is -169 mSec 2014-10-20 12:53:30,838 [INFO] [AbstractCommandTasklet] - time drift between controller and driver-driver01 is -169 mSec 2014-10-20 12:53:30,838 [INFO] [AbstractCommandTasklet] - time drift between controller and driver-driver04 is -168 mSec 2014-10-20 12:53:30,840 [INFO] [AbstractCommandTasklet] - time drift between controller and driver-driver06 is -165 mSec 2014-10-20 12:53:30,840 [INFO] [AbstractCommandTasklet] - time drift between controller and driver-driver02 is -164 mSec 2014-10-20 12:53:30,840 [INFO] [AbstractCommandTasklet] - time drift between controller and driver-driver36 is -163 mSec 2014-10-20 12:53:30,840 [INFO] [AbstractCommandTasklet] - time drift between controller and driver-driver07 is -164 mSec 2014-10-20 12:53:30,840 [INFO] [AbstractCommandTasklet] - time drift between controller and driver-driver03 is -166 mSec 2014-10-20 12:53:30,840 [INFO] [AbstractCommandTasklet] - time drift between controller and driver-driver18 is -163 mSec 2014-10-20 12:53:30,841 [INFO] [AbstractCommandTasklet] - time drift between controller and driver-driver25 is -162 mSec 2014-10-20 12:53:30,841 [INFO] [AbstractCommandTasklet] - time drift between controller and driver-driver30 is -162 mSec 2014-10-20 12:53:30,841 [INFO] [AbstractCommandTasklet] - time drift between controller and driver-driver19 is -162 mSec 2014-10-20 12:53:30,842 [INFO] [AbstractCommandTasklet] - time drift between controller and driver-driver39 is -161 mSec 2014-10-20 12:53:30,842 [INFO] [AbstractCommandTasklet] - time drift between controller and driver-driver21 is -161 mSec 2014-10-20 12:53:30,843 [INFO] [AbstractCommandTasklet] - time drift between controller and driver-driver40 is -160 mSec 2014-10-20 12:53:30,844 [INFO] [AbstractCommandTasklet] - time drift between controller and driver-driver17 is -160 mSec 2014-10-20 12:53:30,845 [INFO] [AbstractCommandTasklet] - time drift between controller and driver-driver24 is -159 mSec 2014-10-20 12:53:30,845 [INFO] [AbstractCommandTasklet] - time drift between controller and driver-driver20 is -159 mSec 2014-10-20 12:53:30,845 [INFO] [AbstractCommandTasklet] - time drift between controller and driver-driver45 is -158 mSec 2014-10-20 12:53:30,845 [INFO] [AbstractCommandTasklet] - time drift between controller and driver-driver14 is -159 mSec 2014-10-20 12:53:30,845 [INFO] [AbstractCommandTasklet] - time drift between controller and driver-driver33 is -158 mSec 2014-10-20 12:53:30,846 [INFO] [AbstractCommandTasklet] - time drift between controller and driver-driver31 is -157 mSec 2014-10-20 12:53:30,846 [INFO] [AbstractCommandTasklet] - time drift between controller and driver-driver47 is -157 mSec 2014-10-20 12:53:30,846 [INFO] [AbstractCommandTasklet] - time drift between controller and driver-driver26 is -158 mSec 2014-10-20 12:53:30,846 [INFO] [AbstractCommandTasklet] - time drift between controller and driver-driver29 is -158 mSec 2014-10-20 12:53:30,846 [INFO] [AbstractCommandTasklet] - time drift between controller and driver-driver23 is -158 mSec 2014-10-20 12:53:30,846 [INFO] [AbstractCommandTasklet] - time drift between controller and driver-driver27 is -158 mSec 2014-10-20 12:53:30,847 [INFO] [AbstractCommandTasklet] - time drift between controller and driver-driver51 is -156 mSec 2014-10-20 12:53:30,847 [INFO] [AbstractCommandTasklet] - time drift between controller and driver-driver22 is -157 mSec 2014-10-20 12:53:30,848 [INFO] [AbstractCommandTasklet] - time drift between controller and driver-driver43 is -155 mSec 2014-10-20 12:53:30,848 [INFO] [AbstractCommandTasklet] - time drift between controller and driver-driver49 is -155 mSec 2014-10-20 12:53:30,848 [INFO] [AbstractCommandTasklet] - time drift between controller and driver-driver35 is -155 mSec 2014-10-20 12:53:30,848 [INFO] [AbstractCommandTasklet] - time drift between controller and driver-driver34 is -156 mSec 2014-10-20 12:53:30,849 [INFO] [AbstractCommandTasklet] - time drift between controller and driver-driver32 is -155 mSec 2014-10-20 12:53:30,849 [INFO] [AbstractCommandTasklet] - time drift between controller and driver-driver53 is -154 mSec 2014-10-20 12:53:30,849 [INFO] [AbstractCommandTasklet] - time drift between controller and driver-driver52 is -154 mSec 2014-10-20 12:53:30,850 [INFO] [AbstractCommandTasklet] - time drift between controller and driver-driver37 is -155 mSec 2014-10-20 12:53:30,850 [INFO] [AbstractCommandTasklet] - time drift between controller and driver-driver56 is -153 mSec 2014-10-20 12:53:30,850 [INFO] [AbstractCommandTasklet] - time drift between controller and driver-driver55 is -153 mSec 2014-10-20 12:53:30,850 [INFO] [AbstractCommandTasklet] - time drift between controller and driver-driver42 is -154 mSec 2014-10-20 12:53:30,851 [INFO] [AbstractCommandTasklet] - time drift between controller and driver-driver44 is -153 mSec 2014-10-20 12:53:30,851 [INFO] [AbstractCommandTasklet] - time drift between controller and driver-driver60 is -152 mSec 2014-10-20 12:53:30,851 [INFO] [AbstractCommandTasklet] - time drift between controller and driver-driver61 is -152 mSec 2014-10-20 12:53:30,852 [INFO] [AbstractCommandTasklet] - time drift between controller and driver-driver38 is -152 mSec 2014-10-20 12:53:30,852 [INFO] [AbstractCommandTasklet] - time drift between controller and driver-driver65 is -151 mSec 2014-10-20 12:53:30,852 [INFO] [AbstractCommandTasklet] - time drift between controller and driver-driver48 is -152 mSec 2014-10-20 12:53:30,852 [INFO] [AbstractCommandTasklet] - time drift between controller and driver-driver46 is -152 mSec 2014-10-20 12:53:30,853 [INFO] [AbstractCommandTasklet] - time drift between controller and driver-driver64 is -150 mSec 2014-10-20 12:53:30,853 [INFO] [AbstractCommandTasklet] - time drift between controller and driver-driver62 is -150 mSec 2014-10-20 12:53:30,853 [INFO] [AbstractCommandTasklet] - time drift between controller and driver-driver63 is -150 mSec 2014-10-20 12:53:30,854 [INFO] [AbstractCommandTasklet] - time drift between controller and driver-driver50 is -150 mSec 2014-10-20 12:53:30,854 [INFO] [AbstractCommandTasklet] - time drift between controller and driver-driver69 is -149 mSec 2014-10-20 12:53:30,854 [INFO] [AbstractCommandTasklet] - time drift between controller and driver-driver67 is -149 mSec 2014-10-20 12:53:30,854 [INFO] [AbstractCommandTasklet] - time drift between controller and driver-driver68 is -149 mSec 2014-10-20 12:53:30,855 [INFO] [AbstractCommandTasklet] - time drift between controller and driver-driver28 is -148 mSec 2014-10-20 12:53:30,855 [INFO] [AbstractCommandTasklet] - time drift between controller and driver-driver41 is -150 mSec 2014-10-20 12:53:30,855 [INFO] [AbstractCommandTasklet] - time drift between controller and driver-driver71 is -148 mSec 2014-10-20 12:53:30,856 [INFO] [AbstractCommandTasklet] - time drift between controller and driver-driver54 is -148 mSec 2014-10-20 12:53:30,856 [INFO] [AbstractCommandTasklet] - time drift between controller and driver-driver57 is -148 mSec 2014-10-20 12:53:30,858 [INFO] [AbstractCommandTasklet] - time drift between controller and driver-driver77 is -145 mSec 2014-10-20 12:53:30,859 [INFO] [AbstractCommandTasklet] - time drift between controller and driver-driver58 is -145 mSec 2014-10-20 12:53:30,859 [INFO] [AbstractCommandTasklet] - time drift between controller and driver-driver59 is -145 mSec 2014-10-20 12:53:30,861 [INFO] [AbstractCommandTasklet] - time drift between controller and driver-driver87 is -143 mSec 2014-10-20 12:53:30,861 [INFO] [AbstractCommandTasklet] - time drift between controller and driver-driver84 is -142 mSec 2014-10-20 12:53:30,861 [INFO] [AbstractCommandTasklet] - time drift between controller and driver-driver66 is -143 mSec 2014-10-20 12:53:30,861 [INFO] [AbstractCommandTasklet] - time drift between controller and driver-driver70 is -143 mSec 2014-10-20 12:53:30,861 [INFO] [AbstractCommandTasklet] - time drift between controller and driver-driver85 is -142 mSec 2014-10-20 12:53:30,863 [INFO] [AbstractCommandTasklet] - time drift between controller and driver-driver73 is -141 mSec 2014-10-20 12:53:30,863 [INFO] [AbstractCommandTasklet] - time drift between controller and driver-driver89 is -140 mSec 2014-10-20 12:53:30,863 [INFO] [AbstractCommandTasklet] - time drift between controller and driver-driver74 is -141 mSec 2014-10-20 12:53:30,863 [INFO] [AbstractCommandTasklet] - time drift between controller and driver-driver72 is -141 mSec 2014-10-20 12:53:30,864 [INFO] [AbstractCommandTasklet] - time drift between controller and driver-driver76 is -140 mSec 2014-10-20 12:53:30,864 [INFO] [AbstractCommandTasklet] - time drift between controller and driver-driver78 is -139 mSec 2014-10-20 12:53:30,864 [INFO] [AbstractCommandTasklet] - time drift between controller and driver-driver75 is -140 mSec 2014-10-20 12:53:30,865 [INFO] [AbstractCommandTasklet] - time drift between controller and driver-driver80 is -139 mSec 2014-10-20 12:53:30,865 [INFO] [AbstractCommandTasklet] - time drift between controller and driver-driver79 is -139 mSec 2014-10-20 12:53:30,865 [INFO] [AbstractCommandTasklet] - time drift between controller and driver-driver81 is -138 mSec 2014-10-20 12:53:30,866 [INFO] [AbstractCommandTasklet] - time drift between controller and driver-driver93 is -138 mSec 2014-10-20 12:53:30,866 [INFO] [AbstractCommandTasklet] - time drift between controller and driver-driver96 is -137 mSec 2014-10-20 12:53:30,868 [INFO] [AbstractCommandTasklet] - time drift between controller and driver-driver82 is -135 mSec 2014-10-20 12:53:30,868 [INFO] [AbstractCommandTasklet] - time drift between controller and driver-driver94 is -135 mSec 2014-10-20 12:53:30,868 [INFO] [AbstractCommandTasklet] - time drift between controller and driver-driver83 is -136 mSec 2014-10-20 12:53:30,868 [INFO] [AbstractCommandTasklet] - time drift between controller and driver-driver92 is -136 mSec 2014-10-20 12:53:30,869 [INFO] [AbstractCommandTasklet] - time drift between controller and driver-driver88 is -135 mSec 2014-10-20 12:53:30,869 [INFO] [AbstractCommandTasklet] - time drift between controller and driver-driver86 is -135 mSec 2014-10-20 12:53:30,870 [INFO] [AbstractCommandTasklet] - time drift between controller and driver-driver91 is -134 mSec 2014-10-20 12:53:30,870 [INFO] [AbstractCommandTasklet] - time drift between controller and driver-driver90 is -134 mSec 2014-10-20 12:53:30,872 [INFO] [AbstractCommandTasklet] - time drift between controller and driver-driver95 is -132 mSec 2014-10-20 12:53:30,873 [INFO] [StageRunner] - successfully booted all tasks in stage s1-prepare 2014-10-20 12:53:30,940 [INFO] [StageRunner] - successfully submitted all tasks in stage s1-prepare 2014-10-20 12:53:31,130 [INFO] [StageRunner] - successfully authenticated all tasks in stage s1-prepare 2014-10-20 12:53:31,346 [INFO] [StageRunner] - successfully launched all tasks in stage s1-prepare 2014-10-20 13:13:12,357 [INFO] [StageRunner] - successfully queried all tasks in stage s1-prepare 2014-10-20 13:13:42,605 [INFO] [StageRunner] - successfully closed all tasks in stage s1-prepare 2014-10-20 13:13:45,277 [INFO] [WorkloadProcessor] - END WORK: prepare, Time elapsed: 0:20::14 2014-10-20 13:13:45,277 [INFO] [WorkloadProcessor] - ============================================ 2014-10-20 13:13:45,277 [INFO] [WorkloadProcessor] - 2014-10-20 13:13:45,277 [INFO] [WorkloadProcessor] - successfully ran stage s1-prepare
Here is the start of the workload file:
<?xml version="1.0" encoding="UTF-8" ?>