oar-team / batsim

Batsim: Infrastructure simulator for job and I/O scheduling
GNU Lesser General Public License v3.0
30 stars 15 forks source link

[bug] Multicore: undesired behavior #62

Closed Mema5 closed 3 years ago

Mema5 commented 3 years ago

I try to use the multicore functionality of SimGrid platforms with Batsim but didn't manage so far. Running a job with 2 parallel tasks on a 2-core machine takes twice the time than running a job with only 1 task. We would expect the exact same execution time.

Description of the bug

I run this simple workload: one job with 1 parallel task and one job with 2 parallel tasks.

{
    "description": "Test multicore in batsim.",
    "nb_res": 2, 
    "jobs": [
        {"id": "0", "profile": "blast", "res": 1, "subtime": 0},
        {"id": "1", "profile": "blast", "res": 2, "subtime": 0}
    ],
    "profiles": {
        "blast": {"com": 0.0, "cpu": 6.63e14, "type": "parallel_homogeneous"}
    }
}

I have a platform with a 2-core machine:

<?xml version='1.0'?>
<!DOCTYPE platform SYSTEM "http://simgrid.gforge.inria.fr/simgrid/simgrid.dtd">
<platform version="4.1">
<config id="General">
        <prop id="contexts/stack-size" value="16"></prop>
        <prop id="contexts/guard-size" value="0"></prop>
</config>

<zone id="toy_g5k"  routing="Full">
    <host id="master_host" speed="100Mf"></host>
    <host id="multicore" core="2" speed="11.77Gf"></host>
</zone>
</platform>

My scheduler is a slight modification of the batsched scheduler sequencer. It schedules the jobs one after the other on only one machine. It uses the custom mapping to schedule all the executors on the first machine. Below is the only modified function make_decisions():

void MulticoreFiller::make_decisions(
    double date, SortableJobOrder::UpdateInformation *update_info, SortableJobOrder::CompareInformation *compare_info)
{
    // Code base taken from sequencer.cpp
    // This algorithm executes all the jobs, one after the other.
    // All executors of the jobs are scheduled in the first machine,
    // to test the multicore management.
    // At any time, either 0 or 1 job is running on the platform.
    // The order of the sequence depends on the queue order.

    // Up to one job finished since last call.
    PPK_ASSERT_ERROR(_jobs_ended_recently.size() <= 1);
    if (!_jobs_ended_recently.empty())
    {
        PPK_ASSERT_ERROR(_isJobRunning);
        _isJobRunning = false;
    }

    // Add valid jobs into the queue
    for (const std::string &job_id : _jobs_released_recently)
    {
        const Job *job = (*_workload)[job_id];

        if (true)
            // we never reject a job, we always try to schedule it on single machine
            _queue->append_job(job, update_info);
        else
            _decision->add_reject_job(job->id, date);
    }

    // Sort queue if needed
    _queue->sort_queue(update_info, compare_info);

    // Execute the first job on the first machine, thanks to custom mapping
    const Job *job = _queue->first_job_or_nullptr();
    if (job != nullptr && !_isJobRunning)
    {
        vector<int> mapping(job->nb_requested_resources, 0);
        _decision->add_execute_job(job->id, IntervalSet(first_machine), date, mapping);
        _isJobRunning = true;
        _queue->remove_job(job);
    }
}

Behavior

The first job has an execution time of 56329.651657 s and the second job takes twice that time (112659.303314 s).

We would expect the same execution time, as it is supposed to work in SimGrid...

Here is the full batsim log with debug verbosity activated:

+ batsim -p platforms/toy_pform_multicore.xml -w workloads/toy_wload_multicore.json -e ../out/reproduce_guyon/multicore_filler/ --forward-unknown-event -v debug
[0.000000] [batsim/INFO] Workload 'w0' corresponds to workload file '/home/mael/ownCloud/workspace/batsim/exp_batsim/src/workloads/toy_wload_multicore.json'.
[0.000000] [batsim/INFO] Batsim version: 4.0.0
[0.000000] [workload/INFO] Loading JSON workload '/home/mael/ownCloud/workspace/batsim/exp_batsim/src/workloads/toy_wload_multicore.json'...
[0.000000] [jobs/INFO] job 'w0!0' has no 'walltime' field
[0.000000] ../src/jobs.cpp:538: [jobs/DEBUG] Job 'w0!0' Loaded
[0.000000] [jobs/INFO] job 'w0!1' has no 'walltime' field
[0.000000] ../src/jobs.cpp:538: [jobs/DEBUG] Job 'w0!1' Loaded
[0.000000] [workload/INFO] JSON workload parsed sucessfully. Read 2 jobs and 1 profiles.
[0.000000] [workload/INFO] Checking workload validity...
[0.000000] [workload/INFO] Workload seems to be valid.
[0.000000] [workload/INFO] Removing unreferenced profiles from memory...
[0.000000] [xbt_cfg/INFO] Configuration change: Set 'host/model' to 'ptask_L07'
[0.000000] [batsim/INFO] Checking whether SMPI is used or not...
[0.000000] [machines/INFO] Creating the machines from platform file 'platforms/toy_pform_multicore.xml'...
[0.000000] [xbt_cfg/INFO] Configuration change: Set 'contexts/guard-size' to '0'
[0.000000] [xbt_cfg/INFO] Configuration change: Set 'contexts/stack-size' to '16'
[0.000000] [xbt_cfg/INFO] Switching to the L07 model to handle parallel tasks.
[0.000000] [machines/INFO] Looking for master host 'master_host'
[0.000000] [machines/INFO] The machines have been created successfully. There are 1 computing machines.
[0.000000] [batsim/INFO] Batsim's export prefix is '../out/reproduce_guyon/multicore_filler/'.
[0.000000] [batsim/INFO] The process 'workload_submitter_w0' has been created.
[0.000000] [batsim/INFO] The process 'server' has been created.
[master_host:workload_submitter_w0:(1) 0.000000] ../src/ipp.cpp:24: [ipp/DEBUG] message from 'workload_submitter_w0' to 'server' of type 'SUBMITTER_HELLO' with data 0x1893e00
[master_host:Scheduler REQ-REP:(3) 0.000000] ../src/network.cpp:29: [network/DEBUG] Buffer received in REQ-REP: '{"now":0.000000,"events":[{"timestamp":0.000000,"type":"SIMULATION_BEGINS","data":{"nb_resources":1,"nb_compute_resources":1,"nb_storage_resources":0,"allow_compute_sharing":false,"allow_storage_sharing":true,"config":{"redis-enabled":false,"redis-hostname":"127.0.0.1","redis-port":6379,"redis-prefix":"default","profiles-forwarded-on-submission":false,"dynamic-jobs-enabled":false,"dynamic-jobs-acknowledged":false,"profile-reuse-enabled":false,"sched-config":"","forward-unknown-events":false},"compute_resources":[{"id":0,"name":"multicore","state":"idle","properties":{"role":""},"zone_properties":{}}],"storage_resources":[],"workloads":{"w0":"/home/mael/ownCloud/workspace/batsim/exp_batsim/src/workloads/toy_wload_multicore.json"},"profiles":{"w0":{"blast":{"com":0.000000,"cpu":663000000000000.000000,"type":"parallel_homogeneous"}}}}}]}'
[master_host:Scheduler REQ-REP:(3) 0.000000] [network/INFO] Sending '{"now":0.000000,"events":[{"timestamp":0.000000,"type":"SIMULATION_BEGINS","data":{"nb_resources":1,"nb_compute_resources":1,"nb_storage_resources":0,"allow_compute_sharing":false,"allow_storage_sharing":true,"config":{"redis-enabled":false,"redis-hostname":"127.0.0.1","redis-port":6379,"redis-prefix":"default","profiles-forwarded-on-submission":false,"dynamic-jobs-enabled":false,"dynamic-jobs-acknowledged":false,"profile-reuse-enabled":false,"sched-config":"","forward-unknown-events":false},"compute_resources":[{"id":0,"name":"multicore","state":"idle","properties":{"role":""},"zone_properties":{}}],"storage_resources":[],"workloads":{"w0":"/home/mael/ownCloud/workspace/batsim/exp_batsim/src/workloads/toy_wload_multicore.json"},"profiles":{"w0":{"blast":{"com":0.000000,"cpu":663000000000000.000000,"type":"parallel_homogeneous"}}}}}]}'
[master_host:Scheduler REQ-REP:(3) 0.000000] [network/INFO] Received '{"now":0.0,"events":[]}'
[master_host:Scheduler REQ-REP:(3) 0.000000] ../src/ipp.cpp:24: [ipp/DEBUG] message from 'Scheduler REQ-REP' to 'server' of type 'SCHED_READY' with data (nil)
[master_host:workload_submitter_w0:(1) 0.000015] ../src/ipp.cpp:37: [ipp/DEBUG] message from 'workload_submitter_w0' to 'server' of type 'SUBMITTER_HELLO' with data 0x1893e00 done
[master_host:workload_submitter_w0:(1) 0.000015] ../src/ipp.cpp:24: [ipp/DEBUG] message from 'workload_submitter_w0' to 'server' of type 'JOB_SUBMITTED' with data 0x18a43e0
[master_host:server:(2) 0.000015] ../src/server.cpp:95: [server/DEBUG] Server received a message of type SUBMITTER_HELLO:
[master_host:server:(2) 0.000015] ../src/server.cpp:191: [server/DEBUG] New Job submitter said hello. Number of polite Job submitters: 1
[master_host:Scheduler REQ-REP:(3) 0.000030] ../src/ipp.cpp:37: [ipp/DEBUG] message from 'Scheduler REQ-REP' to 'server' of type 'SCHED_READY' with data (nil) done
[master_host:server:(2) 0.000030] ../src/server.cpp:95: [server/DEBUG] Server received a message of type SCHED_READY:
[master_host:workload_submitter_w0:(1) 0.000045] ../src/ipp.cpp:37: [ipp/DEBUG] message from 'workload_submitter_w0' to 'server' of type 'JOB_SUBMITTED' with data 0x18a43e0 done
[master_host:workload_submitter_w0:(1) 0.000045] ../src/ipp.cpp:24: [ipp/DEBUG] message from 'workload_submitter_w0' to 'server' of type 'SUBMITTER_BYE' with data 0x195a190
[master_host:server:(2) 0.000045] ../src/server.cpp:95: [server/DEBUG] Server received a message of type JOB_SUBMITTED:
[master_host:server:(2) 0.000045] ../src/server.cpp:301: [server/DEBUG] Job received: w0!0
[master_host:server:(2) 0.000045] ../src/server.cpp:303: [server/DEBUG] Workloads: w0 
[master_host:server:(2) 0.000045] [server/INFO] Job w0!0 SUBMITTED. 1 jobs submitted so far
[master_host:server:(2) 0.000045] ../src/server.cpp:301: [server/DEBUG] Job received: w0!1
[master_host:server:(2) 0.000045] ../src/server.cpp:303: [server/DEBUG] Workloads: w0 
[master_host:server:(2) 0.000045] [server/INFO] Job w0!1 SUBMITTED. 2 jobs submitted so far
[master_host:Scheduler REQ-REP:(4) 0.000045] ../src/network.cpp:29: [network/DEBUG] Buffer received in REQ-REP: '{"now":0.000045,"events":[{"timestamp":0.000045,"type":"JOB_SUBMITTED","data":{"job_id":"w0!0","job":{"id":"w0!0","profile":"blast","res":1,"subtime":0}}},{"timestamp":0.000045,"type":"JOB_SUBMITTED","data":{"job_id":"w0!1","job":{"id":"w0!1","profile":"blast","res":2,"subtime":0}}}]}'
[master_host:Scheduler REQ-REP:(4) 0.000045] [network/INFO] Sending '{"now":0.000045,"events":[{"timestamp":0.000045,"type":"JOB_SUBMITTED","data":{"job_id":"w0!0","job":{"id":"w0!0","profile":"blast","res":1,"subtime":0}}},{"timestamp":0.000045,"type":"JOB_SUBMITTED","data":{"job_id":"w0!1","job":{"id":"w0!1","profile":"blast","res":2,"subtime":0}}}]}'
[master_host:Scheduler REQ-REP:(4) 0.000045] [network/INFO] Received '{"now":0.000045,"events":[{"timestamp":0.000045,"type":"EXECUTE_JOB","data":{"job_id":"w0!0","alloc":"0","mapping":{"0":"0"}}}]}'
[master_host:Scheduler REQ-REP:(4) 0.000045] ../src/protocol.cpp:755: [protocol/DEBUG] Starting event processing (number: 0, Type: EXECUTE_JOB)
[master_host:Scheduler REQ-REP:(4) 0.000045] ../src/protocol.cpp:1125: [protocol/DEBUG] The optional field 'additional_io_job' was not found
[master_host:Scheduler REQ-REP:(4) 0.000045] ../src/ipp.cpp:24: [ipp/DEBUG] message from 'Scheduler REQ-REP' to 'server' of type 'SCHED_EXECUTE_JOB' with data 0x18ba7f0
[master_host:workload_submitter_w0:(1) 0.000060] ../src/ipp.cpp:37: [ipp/DEBUG] message from 'workload_submitter_w0' to 'server' of type 'SUBMITTER_BYE' with data 0x195a190 done
[master_host:server:(2) 0.000060] ../src/server.cpp:95: [server/DEBUG] Server received a message of type SUBMITTER_BYE:
[master_host:server:(2) 0.000060] ../src/server.cpp:212: [server/DEBUG] Job submitter said goodbye. Number of finished Job submitters: 1
[master_host:Scheduler REQ-REP:(4) 0.000075] ../src/ipp.cpp:37: [ipp/DEBUG] message from 'Scheduler REQ-REP' to 'server' of type 'SCHED_EXECUTE_JOB' with data 0x18ba7f0 done
[master_host:Scheduler REQ-REP:(4) 0.000075] ../src/protocol.cpp:758: [protocol/DEBUG] Finished event processing (number: 0, Type: EXECUTE_JOB)
[master_host:Scheduler REQ-REP:(4) 0.000075] ../src/ipp.cpp:24: [ipp/DEBUG] message from 'Scheduler REQ-REP' to 'server' of type 'SCHED_READY' with data (nil)
[master_host:server:(2) 0.000075] ../src/server.cpp:95: [server/DEBUG] Server received a message of type SCHED_EXECUTE_JOB:
[multicore:job_w0!0:(5) 0.000075] ../src/jobs_execution.cpp:409: [jobs_execution/DEBUG] IO allocation: , size of the allocation: 0
[multicore:job_w0!0:(5) 0.000075] ../src/task_execution.cpp:478: [task_execution/DEBUG] Generating comm/compute matrix for task 'PARALLEL_HOMOGENEOUS_w0!0_blast' with allocation 0
[multicore:job_w0!0:(5) 0.000075] ../src/task_execution.cpp:392: [task_execution/DEBUG] Number of hosts to use: 1
[multicore:job_w0!0:(5) 0.000075] ../src/task_execution.cpp:452: [task_execution/DEBUG] enforcing permission for machine id: 0
[multicore:job_w0!0:(5) 0.000075] ../src/task_execution.cpp:457: [task_execution/DEBUG] found computation: 663000000000000
[multicore:job_w0!0:(5) 0.000075] ../src/task_execution.cpp:638: [task_execution/DEBUG] Creating parallel task 'PARALLEL_HOMOGENEOUS_w0!0_blast' on 1 resources
[multicore:job_w0!0:(5) 0.000075] ../src/task_execution.cpp:651: [task_execution/DEBUG] Executing task 'PARALLEL_HOMOGENEOUS_w0!0_blast' without walltime
[master_host:Scheduler REQ-REP:(4) 0.000090] ../src/ipp.cpp:37: [ipp/DEBUG] message from 'Scheduler REQ-REP' to 'server' of type 'SCHED_READY' with data (nil) done
[master_host:server:(2) 0.000090] ../src/server.cpp:95: [server/DEBUG] Server received a message of type SCHED_READY:
[master_host:Scheduler REQ-REP:(6) 0.000090] ../src/network.cpp:29: [network/DEBUG] Buffer received in REQ-REP: '{"now":0.000090,"events":[{"timestamp":0.000060,"type":"NOTIFY","data":{"type":"no_more_static_job_to_submit"}}]}'
[master_host:Scheduler REQ-REP:(6) 0.000090] [network/INFO] Sending '{"now":0.000090,"events":[{"timestamp":0.000060,"type":"NOTIFY","data":{"type":"no_more_static_job_to_submit"}}]}'
[master_host:Scheduler REQ-REP:(6) 0.000090] [network/INFO] Received '{"now":0.00009,"events":[]}'
[master_host:Scheduler REQ-REP:(6) 0.000090] ../src/ipp.cpp:24: [ipp/DEBUG] message from 'Scheduler REQ-REP' to 'server' of type 'SCHED_READY' with data (nil)
[master_host:Scheduler REQ-REP:(6) 0.000105] ../src/ipp.cpp:37: [ipp/DEBUG] message from 'Scheduler REQ-REP' to 'server' of type 'SCHED_READY' with data (nil) done
[master_host:server:(2) 0.000105] ../src/server.cpp:95: [server/DEBUG] Server received a message of type SCHED_READY:
[multicore:job_w0!0:(5) 56329.651732] ../src/task_execution.cpp:686: [task_execution/DEBUG] Task 'PARALLEL_HOMOGENEOUS_w0!0_blast' finished in 56329.651657
[multicore:job_w0!0:(5) 56329.651732] [jobs_execution/INFO] Job 'w0!0' finished in time (success)
[multicore:job_w0!0:(5) 56329.651732] ../src/ipp.cpp:24: [ipp/DEBUG] message from 'job_w0!0' to 'server' of type 'JOB_COMPLETED' with data 0x18a5db0
[master_host:server:(2) 56329.651732] ../src/server.cpp:95: [server/DEBUG] Server received a message of type JOB_COMPLETED:
[master_host:server:(2) 56329.651732] [server/INFO] Job w0!0 has COMPLETED. 1 jobs completed so far
[multicore:job_w0!0:(5) 56329.651732] ../src/ipp.cpp:37: [ipp/DEBUG] message from 'job_w0!0' to 'server' of type 'JOB_COMPLETED' with data 0x18a5db0 done
[master_host:Scheduler REQ-REP:(7) 56329.651732] ../src/network.cpp:29: [network/DEBUG] Buffer received in REQ-REP: '{"now":56329.651732,"events":[{"timestamp":56329.651732,"type":"JOB_COMPLETED","data":{"job_id":"w0!0","job_state":"COMPLETED_SUCCESSFULLY","return_code":0,"alloc":"0"}}]}'
[master_host:Scheduler REQ-REP:(7) 56329.651732] [network/INFO] Sending '{"now":56329.651732,"events":[{"timestamp":56329.651732,"type":"JOB_COMPLETED","data":{"job_id":"w0!0","job_state":"COMPLETED_SUCCESSFULLY","return_code":0,"alloc":"0"}}]}'
[master_host:Scheduler REQ-REP:(7) 56329.651732] [network/INFO] Received '{"now":56329.651732,"events":[{"timestamp":56329.651732,"type":"EXECUTE_JOB","data":{"job_id":"w0!1","alloc":"0","mapping":{"0":"0","1":"0"}}}]}'
[master_host:Scheduler REQ-REP:(7) 56329.651732] ../src/protocol.cpp:755: [protocol/DEBUG] Starting event processing (number: 0, Type: EXECUTE_JOB)
[master_host:Scheduler REQ-REP:(7) 56329.651732] ../src/protocol.cpp:1125: [protocol/DEBUG] The optional field 'additional_io_job' was not found
[master_host:Scheduler REQ-REP:(7) 56329.651732] ../src/ipp.cpp:24: [ipp/DEBUG] message from 'Scheduler REQ-REP' to 'server' of type 'SCHED_EXECUTE_JOB' with data 0x18a5bd0
[master_host:server:(2) 56329.651747] ../src/server.cpp:95: [server/DEBUG] Server received a message of type SCHED_EXECUTE_JOB:
[master_host:Scheduler REQ-REP:(7) 56329.651747] ../src/ipp.cpp:37: [ipp/DEBUG] message from 'Scheduler REQ-REP' to 'server' of type 'SCHED_EXECUTE_JOB' with data 0x18a5bd0 done
[master_host:Scheduler REQ-REP:(7) 56329.651747] ../src/protocol.cpp:758: [protocol/DEBUG] Finished event processing (number: 0, Type: EXECUTE_JOB)
[master_host:Scheduler REQ-REP:(7) 56329.651747] ../src/ipp.cpp:24: [ipp/DEBUG] message from 'Scheduler REQ-REP' to 'server' of type 'SCHED_READY' with data (nil)
[multicore:job_w0!1:(8) 56329.651747] ../src/jobs_execution.cpp:409: [jobs_execution/DEBUG] IO allocation: , size of the allocation: 0
[multicore:job_w0!1:(8) 56329.651747] ../src/task_execution.cpp:478: [task_execution/DEBUG] Generating comm/compute matrix for task 'PARALLEL_HOMOGENEOUS_w0!1_blast' with allocation 0
[multicore:job_w0!1:(8) 56329.651747] ../src/task_execution.cpp:392: [task_execution/DEBUG] Number of hosts to use: 2
[multicore:job_w0!1:(8) 56329.651747] ../src/task_execution.cpp:452: [task_execution/DEBUG] enforcing permission for machine id: 0
[multicore:job_w0!1:(8) 56329.651747] ../src/task_execution.cpp:457: [task_execution/DEBUG] found computation: 663000000000000
[multicore:job_w0!1:(8) 56329.651747] ../src/task_execution.cpp:638: [task_execution/DEBUG] Creating parallel task 'PARALLEL_HOMOGENEOUS_w0!1_blast' on 2 resources
[multicore:job_w0!1:(8) 56329.651747] ../src/task_execution.cpp:651: [task_execution/DEBUG] Executing task 'PARALLEL_HOMOGENEOUS_w0!1_blast' without walltime
[master_host:Scheduler REQ-REP:(7) 56329.651762] ../src/ipp.cpp:37: [ipp/DEBUG] message from 'Scheduler REQ-REP' to 'server' of type 'SCHED_READY' with data (nil) done
[master_host:server:(2) 56329.651762] ../src/server.cpp:95: [server/DEBUG] Server received a message of type SCHED_READY:
[multicore:job_w0!1:(8) 168988.955061] ../src/task_execution.cpp:686: [task_execution/DEBUG] Task 'PARALLEL_HOMOGENEOUS_w0!1_blast' finished in 112659.303314
[multicore:job_w0!1:(8) 168988.955061] [jobs_execution/INFO] Job 'w0!1' finished in time (success)
[multicore:job_w0!1:(8) 168988.955061] ../src/ipp.cpp:24: [ipp/DEBUG] message from 'job_w0!1' to 'server' of type 'JOB_COMPLETED' with data 0x18a4040
[master_host:server:(2) 168988.955061] ../src/server.cpp:95: [server/DEBUG] Server received a message of type JOB_COMPLETED:
[master_host:server:(2) 168988.955061] [server/INFO] Job w0!1 has COMPLETED. 2 jobs completed so far
[multicore:job_w0!1:(8) 168988.955061] ../src/ipp.cpp:37: [ipp/DEBUG] message from 'job_w0!1' to 'server' of type 'JOB_COMPLETED' with data 0x18a4040 done
[master_host:Scheduler REQ-REP:(9) 168988.955061] ../src/network.cpp:29: [network/DEBUG] Buffer received in REQ-REP: '{"now":168988.955061,"events":[{"timestamp":168988.955061,"type":"JOB_COMPLETED","data":{"job_id":"w0!1","job_state":"COMPLETED_SUCCESSFULLY","return_code":0,"alloc":"0"}}]}'
[master_host:Scheduler REQ-REP:(9) 168988.955061] [network/INFO] Sending '{"now":168988.955061,"events":[{"timestamp":168988.955061,"type":"JOB_COMPLETED","data":{"job_id":"w0!1","job_state":"COMPLETED_SUCCESSFULLY","return_code":0,"alloc":"0"}}]}'
[master_host:Scheduler REQ-REP:(9) 168988.955061] [network/INFO] Received '{"now":168988.955061,"events":[]}'
[master_host:Scheduler REQ-REP:(9) 168988.955061] ../src/ipp.cpp:24: [ipp/DEBUG] message from 'Scheduler REQ-REP' to 'server' of type 'SCHED_READY' with data (nil)
[master_host:server:(2) 168988.955076] ../src/server.cpp:95: [server/DEBUG] Server received a message of type SCHED_READY:
[master_host:server:(2) 168988.955076] [server/INFO] The simulation seems finished.
[master_host:Scheduler REQ-REP:(9) 168988.955076] ../src/ipp.cpp:37: [ipp/DEBUG] message from 'Scheduler REQ-REP' to 'server' of type 'SCHED_READY' with data (nil) done
[master_host:Scheduler REQ-REP:(10) 168988.955076] ../src/network.cpp:29: [network/DEBUG] Buffer received in REQ-REP: '{"now":168988.955076,"events":[{"timestamp":168988.955076,"type":"SIMULATION_ENDS","data":{}}]}'
[master_host:Scheduler REQ-REP:(10) 168988.955076] [network/INFO] Sending '{"now":168988.955076,"events":[{"timestamp":168988.955076,"type":"SIMULATION_ENDS","data":{}}]}'
[master_host:Scheduler REQ-REP:(10) 168988.955076] [network/INFO] Received '{"now":168988.955076,"events":[]}'
[master_host:Scheduler REQ-REP:(10) 168988.955076] ../src/ipp.cpp:24: [ipp/DEBUG] message from 'Scheduler REQ-REP' to 'server' of type 'SCHED_READY' with data (nil)
[master_host:Scheduler REQ-REP:(10) 168988.955091] ../src/ipp.cpp:37: [ipp/DEBUG] message from 'Scheduler REQ-REP' to 'server' of type 'SCHED_READY' with data (nil) done
[master_host:server:(2) 168988.955091] ../src/server.cpp:95: [server/DEBUG] Server received a message of type SCHED_READY:
[master_host:server:(2) 168988.955091] [server/INFO] Simulation is finished!
[168988.955091] [export/INFO] PajeTracer finalized
[168988.955091] [export/INFO] jobs=2, finished=2, success=2, killed=0, success_rate=1.000000
[168988.955091] [export/INFO] makespan=168988.955061, scheduling_time=0.003728, mean_waiting_time=28164.825911, mean_turnaround_time=112659.303396, mean_slowdown=1.250000, max_waiting_time=56329.651747, max_turnaround_time=168988.955061, max_slowdown=1.500000
[168988.955091] [export/INFO] mean_machines_running=168988.954970, max_machines_running=168988.954970

Versions

mpoquet commented 3 years ago

Hi! Ptask + multicore has been fixed quite recently in SimGrid, I think latest Batsim release does not use a SimGrid with the fix. Can you try with a more recent Batsim instead? You can compile it locally with the following command:

nix build --arg doCoverage false -f https://framagit.org/batsim/batsim/-/archive/master/batsim-master.tar.gz batsim
./result/bin/batsim --simgrid-version
# should print 3.28.0

(You can also use the batsim-master package defined in NUR-Kapack if you already have a Nix setup)

Mema5 commented 3 years ago

Hi Millian and thank you! Indeed the SimGrid version used by the latest batsim release was 3.25.0 and the bug is fixed with the latest SimGrid release.

>batsim --version
commit e663e5e5213bfaae9c8ef432fcb1c3a4db20b30e (built by Nix from master branch)
>batsim --simgrid-version
3.28.0

Btw, maybe good to add that in the doc : "Warning: the multicore fonctionality only works with a version of SimGrid >= 3.28".