oar-team / batsim

Batsim: Infrastructure simulator for job and I/O scheduling
GNU Lesser General Public License v3.0
30 stars 15 forks source link

Batsim/Batsched output non-determinism on big workloads #27

Closed mpoquet closed 7 years ago

mpoquet commented 7 years ago

It looks like either Batsim or Batsched is non-deterministic.

Happens at least on KTH_SP2 with the opportunistic shutdown algorithm. Investigating.

mpoquet commented 7 years ago

Seems to come from Batsched.

Some results executed with the exact same parameters:

find . -name '*_schedule.csv' | sed 's/\(.*\)/cat \1/' | bash -ex | cut -d',' -f1,2
+ cat ./4b93d2c0/4b93d2c0_schedule.csv
consumed_joules,makespan
237463135243.028046,28763777.000000
+ cat ./b5971c84/b5971c84_schedule.csv
consumed_joules,makespan
237463135763.053040,28763777.000000
+ cat ./28e91144/28e91144_schedule.csv
consumed_joules,makespan
237597411934.027191,28763777.000000
+ cat ./dd11bd65/dd11bd65_schedule.csv
consumed_joules,makespan
237636065102.231934,28763777.000000
+ cat ./e8e9541f/e8e9541f_schedule.csv
consumed_joules,makespan
237597345439.027191,28763777.000000
+ cat ./e0f9b921/e0f9b921_schedule.csv
consumed_joules,makespan
237636454241.189758,28763777.000000

Comparing two batsim logs:

diff --color=always 28e91144/batsim.stderr e8e9541f/batsim.stderr | head -n 16
2c2
< [0.000000] [batsim/INFO] Reading configuration file '/home/carni/notebook/os_tidle/out/results/28e91144/batsim.conf'
---
> [0.000000] [batsim/INFO] Reading configuration file '/home/carni/notebook/os_tidle/out/results/e8e9541f/batsim.conf'
11c11
< [0.000000] [batsim/INFO] Batsim's export prefix is '/home/carni/notebook/os_tidle/out/results/28e91144/28e91144'.
---
> [0.000000] [batsim/INFO] Batsim's export prefix is '/home/carni/notebook/os_tidle/out/results/e8e9541f/e8e9541f'.
104215c104215
< [master_host:Scheduler REQ-REP:(106478) 2780669.520000] [network/INFO] Received '{"now":2780669.52,"events":[{"timestamp":2780669.52,"type":"EXECUTE_JOB","data":{"job_id":"a513cc!2503","alloc":"87"}},{"timestamp":2780669.52,"type":"EXECUTE_JOB","data":{"job_id":"a513cc!2481","alloc":"80-84"}},{"timestamp":2780669.52,"type":"EXECUTE_JOB","data":{"job_id":"a513cc!2500","alloc":"85"}},{"timestamp":2780669.52,"type":"EXECUTE_JOB","data":{"job_id":"a513cc!2502","alloc":"86"}}]}'
---
> [master_host:Scheduler REQ-REP:(106478) 2780669.520000] [network/INFO] Received '{"now":2780669.52,"events":[{"timestamp":2780669.52,"type":"EXECUTE_JOB","data":{"job_id":"a513cc!2500","alloc":"85"}},{"timestamp":2780669.52,"type":"EXECUTE_JOB","data":{"job_id":"a513cc!2503","alloc":"87"}},{"timestamp":2780669.52,"type":"EXECUTE_JOB","data":{"job_id":"a513cc!2481","alloc":"80-84"}},{"timestamp":2780669.52,"type":"EXECUTE_JOB","data":{"job_id":"a513cc!2502","alloc":"86"}}]}'
104641c104641
< [master_host:Scheduler REQ-REP:(106886) 2790636.000000] [network/INFO] Received '{"now":2790636.0,"events":[{"timestamp":2790636.0,"type":"EXECUTE_JOB","data":{"job_id":"a513cc!2524","alloc":"73"}},{"timestamp":2790636.0,"type":"EXECUTE_JOB","data":{"job_id":"a513cc!2521","alloc":"70"}},{"timestamp":2790636.0,"type":"EXECUTE_JOB","data":{"job_id":"a513cc!2522","alloc":"71"}},{"timestamp":2790636.0,"type":"EXECUTE_JOB","data":{"job_id":"a513cc!2523","alloc":"72"}}]}'
---
> [master_host:Scheduler REQ-REP:(106886) 2790636.000000] [network/INFO] Received '{"now":2790636.0,"events":[{"timestamp":2790636.0,"type":"EXECUTE_JOB","data":{"job_id":"a513cc!2524","alloc":"73"}},{"timestamp":2790636.0,"type":"EXECUTE_JOB","data":{"job_id":"a513cc!2522","alloc":"71"}},{"timestamp":2790636.0,"type":"EXECUTE_JOB","data":{"job_id":"a513cc!2521","alloc":"70"}},{"timestamp":2790636.0,"type":"EXECUTE_JOB","data":{"job_id":"a513cc!2523","alloc":"72"}}]}'

Batsched decisions look non-deterministic! :(

mpoquet commented 7 years ago

Opened Batsched issue https://gitlab.inria.fr/batsim/batsched/issues/1.

mpoquet commented 7 years ago

Problem came from the scheduler side, closing the issue.