Open AkihiroSuda opened 8 years ago
Tried to reproduce ZOOKEEPER-2212 with several configs.
All the experiments are done on my local lenovo pc. (Xeon E3-1220 v3 * 4, 8 GB RAM)
EQ Config | #CPU assigned | #Exp | Reproducibility | #Pattern@1000 exp | Notes |
---|---|---|---|---|---|
None | 4 | 5,000 | 0% | 156 | Data is from FOSDEM slide. |
Ether | 4 | 1,000 | 21.8% | 573 | Ditto. With latest EQ + 1 CPU, reproducibility grew to about 50%. |
None | 1 | 1,000 | 0% | N/A | |
None + SCHED_BATCH | 1 | 1,000 | 0% | N/A | |
Proc(mild{UseBatch:true} )(SCHED_BATCH + random nice values) |
1 | 5,000 | 0.7% | 634 | 0.08% experiments failed due to timeout |
Proc(mild{UseBatch:true} ) |
4 | 5,000 | 0.32% | 548 | No experiment failed due to timeout |
Proc(mild{UseBatch:false} ) |
1 | 5,000 | 0.26% | 914 | 90% experiments failed due to timeout |
mild{UseBatch:true}
provides better reproducibility than mild{UseBatch:false}
, but not so good as the Ethernet inspector.mild{UseBatch:false}
provides better pattern growth, but not useful for ZOOKEEPER-2212 due to too many timeouts.extreme
) likely to cause starvation on single CPU, so I did not experimented.dirichlet
) hits the bug mentioned in README.Also tested ZOOKEEPER-2137 with the latest ZooKeeper (just 50 times on 4 CPUs):
EQ Config | #CPU assigned | #Exp | Reproducibility | #Pattern@1000 exp | Notes |
---|---|---|---|---|---|
None | 4 | 50 | 2% | N/A | - |
Proc(mild{UseBatch:true} )(SCHED_BATCH + random nice values) |
4 | 50 | 16% | N/A | - |
Proc(mild{UseBatch:true} ) |
1 | 50 | 2% | N/A | - |
This reproducibility is useful enough (on 4 CPUs).
The process inspector works well with ZOOKEEPER-2137, although not with 2212.
I guess this is because ZOOKEEPER-2137 runs longer (> 1 min) than 2212,
i.e., much more chances to work are given to sched_setattr()
.
I keep this issue ticket open for discussion.
PTAL @mitake
Evaluated some YARN (apache/hadoop@4e4b3a8465a8433e78e015cb1ce7e0dc1ebeb523 ) tests using osrg/earthquake@13aa33b371fc714608061f4671a83dd18d7b25fe (mild{UseBatch:true
), on AWS t2.large (2 CPUs assigned).
Tests are executed 100 times with/without Earthquake.
Note that this version of Earthquake does not contain an optimization (#146)
Test | Reproducibility(without EQ) | Reproducibility(with EQ) |
---|---|---|
YARN-4548(RM/TestCapacityScheduler) | 11% | 82% |
YARN-4556(RM/TestFifoScheduler | 2% | 44% |
YARN-4168(NM/TestLogAggregationService) | 1% | 8% |
YARN-1978(NM/TestLogAggregationService | 0% | 4% |
YARN-4543(NM/TestNodeStatusUpdater) | 0% | 1% |
I found sometimes it is better to apply Namazu (formerly named Earthquake) for stress
process rather than Hadoop mvn
process.
Testcase: YARN-5043 (RM/TestAMRestart) (apache/hadoop@06413da72efed9a50e49efaf7110c220c88a7f4a
) using osrg/namazu@8e4f26836c4affa15a6bb5ade57f21bd9417354e (mild{UseBatch:true)
, on AWS t2.large (2 CPUs assigned). Done 100 times.
Stress: stress --cpu 2
Running stress? | Namazu applied for | Reproducibility |
---|---|---|
N | None | 16% |
Y | None | 12% |
N | mvn | 7% |
Y | stress | 30% |
TODO:
stress
I'd like to report my experiment of etcd 5022: https://github.com/coreos/etcd/issues/5022
w/ or w/o Namazu process inspector | Reproducibility |
---|---|
w/o | 0% |
w/ | 2.7% |
Both of a number of test running in the above experiments is 1000.
Parameters of explorer policy:
explorePolicy = "random"
[explorePolicyParam]
procPolicy = "dirichlet"
We need to quantitatively evaluate the process inspector as well as the Ethernet inspector (FOSDEM presentation slide)