oar-team / batsim

Batsim: Infrastructure simulator for job and I/O scheduling
GNU Lesser General Public License v3.0
30 stars 15 forks source link

Little `*** Error in `./batsim': malloc(): memory corruption (fast): 0x000000000209d610 ***` #32

Closed adfaure closed 7 years ago

adfaure commented 7 years ago

Hello, I found a bug that I might have found the source, but since I am not very familiar with the design of batsim I not sure If I can fix it.

To reproduce the bug it is very simple, I have a scheduler which do the following steps:

when a new job is submitted:
    If no job is running
        launch the new job
    else if a job is running:
        kill the running job
        launch the new job

This basic algorithm will fail with a Error in ./batsim': malloc(): memory corruption (fast) if the job use a profile used by another job.

Here is the full trace:

[nix-shell:~/Projects/batsim/build]$ cat log
[0.000000] [batsim/INFO] Workload 'd6911d' corresponds to workload file '/home/adfaure/Projects/batsim/build/../workload_profiles/stupid.json'.
[0.000000] [workload/INFO] Loading JSON workload '/home/adfaure/Projects/batsim/build/../workload_profiles/stupid.json'...
[0.000000] [workload/INFO] JSON workload parsed sucessfully. Read 40 jobs and 3 profiles.
[0.000000] [workload/INFO] Checking workload validity...
[0.000000] [workload/INFO] Workload seems to be valid.
[0.000000] [batsim/INFO] Checking whether SMPI is used or not...
[0.000000] [batsim/INFO] SMPI will NOT be used.
[0.000000] [xbt_cfg/INFO] Switching to the L07 model to handle parallel tasks.
[0.000000] [machines/INFO] Creating the machines from platform file '../platforms/cluster512.xml'...
[0.000000] [machines/INFO] Looking for master host 'master_host0'
[0.000000] [machines/INFO] Looking for parallel file system host 'pfs_host'
[0.000000] /home/adfaure/Projects/batsim/src/machines.cpp:234: [machines/WARNING] Could not find pfs_host 'pfs_host'!
[0.000000] [machines/INFO] The machines have been created successfully. There are 512 computing machines.
[0.000000] [batsim/INFO] Batsim's export prefix is 'out'.
[0.000000] [batsim/INFO] The process 'workload_submitter_d6911d' has been created.
[0.000000] [batsim/INFO] The process 'server' has been created.
[master_host0:workload_submitter_d6911d:(1) 0.000000] [job_submitter/INFO] Nom : d6911d
[master_host0:Scheduler REQ-REP:(3) 0.000000] [network/INFO] Sending '{"now":0.000000,"events":[{"timestamp":0.000000,"type":"SIMULATION_BEGINS","data":{"nb_resources":512,"config":{"redis":{"enabled":false,"hostname":"127.0.0.1","port":6379,"prefix":"default"},"job_submission":{"forward_profiles":false,"from_scheduler":{"enabled":false,"acknowledge":true}}}}}]}'
[master_host0:Scheduler REQ-REP:(3) 0.000000] [network/INFO] Received '{"now":0.0,"events":[]}'
[master_host0:workload_submitter_d6911d:(1) 0.000600] [job_submitter/INFO] taille vecteur : 40
[master_host0:workload_submitter_d6911d:(1) 0.000600] [job_submitter/INFO] IN STATIC JOB SUBMITTER: '{"profile":"10.0","res":3,"id":"d6911d!0","subtime":0.0,"walltime":11.0}'
[master_host0:server:(2) 0.000600] [server/INFO] Server received a message of type SUBMITTER_HELLO:
[master_host0:server:(2) 0.000600] [server/INFO] New submitter said hello. Number of polite submitters: 1
[master_host0:server:(2) 0.001200] [server/INFO] Server received a message of type SCHED_READY:
[master_host0:server:(2) 0.001800] [server/INFO] Server received a message of type JOB_SUBMITTED:
[master_host0:server:(2) 0.001800] [server/INFO] GOT JOB: d6911d 0

[master_host0:server:(2) 0.001800] [server/INFO] Job d6911d!0 SUBMITTED. 1 jobs submitted so far
[master_host0:Scheduler REQ-REP:(4) 0.001800] [network/INFO] Sending '{"now":0.001800,"events":[{"timestamp":0.001800,"type":"JOB_SUBMITTED","data":{"job_id":"d6911d!0","job":{"profile":"10.0","res":3,"id":"d6911d!0","subtime":0.000000,"walltime":11.000000}}}]}'
[master_host0:Scheduler REQ-REP:(4) 0.001800] [network/INFO] Received '{"now":0.0018,"events":[{"type":"EXECUTE_JOB","timestamp":0.0018,"data":{"job_id":"d6911d!0","alloc":"0-2"}}]}'
[master_host0:server:(2) 0.002400] [server/INFO] Server received a message of type SCHED_EXECUTE_JOB:
[a0:job_d6911d!0:(5) 0.002400] [jobs_execution/INFO] Creating task 'phg 0'10.0''
[a0:job_d6911d!0:(5) 0.002400] [jobs_execution/INFO] Executing task 'phg 0'10.0''
[master_host0:server:(2) 0.003000] [server/INFO] Server received a message of type SCHED_READY:
[master_host0:workload_submitter_d6911d:(1) 0.100000] [job_submitter/INFO] IN STATIC JOB SUBMITTER: '{"profile":"5.0","res":1,"id":"d6911d!1","subtime":0.1,"walltime":50.0}'
[master_host0:server:(2) 0.100600] [server/INFO] Server received a message of type JOB_SUBMITTED:
[master_host0:server:(2) 0.100600] [server/INFO] GOT JOB: d6911d 1

[master_host0:server:(2) 0.100600] [server/INFO] Job d6911d!1 SUBMITTED. 2 jobs submitted so far
[master_host0:Scheduler REQ-REP:(6) 0.100600] [network/INFO] Sending '{"now":0.100600,"events":[{"timestamp":0.100600,"type":"JOB_SUBMITTED","data":{"job_id":"d6911d!1","job":{"profile":"5.0","res":1,"id":"d6911d!1","subtime":0.100000,"walltime":50.000000}}}]}'
[master_host0:Scheduler REQ-REP:(6) 0.100600] [network/INFO] Received '{"now":0.1006,"events":[{"type":"KILL_JOB","timestamp":0.1006,"data":{"job_ids":["d6911d!0"]}},{"type":"EXECUTE_JOB","timestamp":0.1006,"data":{"job_id":"d6911d!1","alloc":"0-0"}}]}'
[master_host0:server:(2) 0.101200] [server/INFO] Server received a message of type SCHED_KILL_JOB:
*** Error in `./batsim': malloc(): memory corruption (fast): 0x00000000026d8610 ***
======= Backtrace: =========
/nix/store/63gvnrj4z154kpyjpskl6s0hwmyx9x0w-glibc-2.25/lib/libc.so.6(+0x711b6)[0x7fd6547701b6]
/nix/store/63gvnrj4z154kpyjpskl6s0hwmyx9x0w-glibc-2.25/lib/libc.so.6(+0x77596)[0x7fd654776596]
/nix/store/63gvnrj4z154kpyjpskl6s0hwmyx9x0w-glibc-2.25/lib/libc.so.6(+0x79974)[0x7fd654778974]
/nix/store/63gvnrj4z154kpyjpskl6s0hwmyx9x0w-glibc-2.25/lib/libc.so.6(__libc_malloc+0x54)[0x7fd65477a314]
/nix/store/x10hgmzbsd4hwjd7qa3vycg1d8amzdsq-simgrid-batsim/lib/libsimgrid.so.3.13.91(xbt_dynar_three_way_partition+0x37)[0x7fd65717b397]
/nix/store/x10hgmzbsd4hwjd7qa3vycg1d8amzdsq-simgrid-batsim/lib/libsimgrid.so.3.13.91(+0x80d3d)[0x7fd656ff0d3d]
/nix/store/x10hgmzbsd4hwjd7qa3vycg1d8amzdsq-simgrid-batsim/lib/libsimgrid.so.3.13.91(SIMIX_run+0x405)[0x7fd656ff1c95]
/nix/store/x10hgmzbsd4hwjd7qa3vycg1d8amzdsq-simgrid-batsim/lib/libsimgrid.so.3.13.91(MSG_main+0x27)[0x7fd65712f287]
./batsim(main+0x49a)[0x43d80a]
/nix/store/63gvnrj4z154kpyjpskl6s0hwmyx9x0w-glibc-2.25/lib/libc.so.6(__libc_start_main+0xf0)[0x7fd65471f530]
./batsim(_start+0x2a)[0x43e87a]
======= Memory map: ========
00400000-00532000 r-xp 00000000 08:12 29233771                           /home/adfaure/Projects/batsim/build/batsim
00732000-00735000 r--p 00132000 08:12 29233771                           /home/adfaure/Projects/batsim/build/batsim
00735000-00736000 rw-p 00135000 08:12 29233771                           /home/adfaure/Projects/batsim/build/batsim
00736000-00737000 rw-p 00000000 00:00 0
02485000-02832000 rw-p 00000000 00:00 0                                  [heap]
7fd640000000-7fd640021000 rw-p 00000000 00:00 0
7fd640021000-7fd644000000 ---p 00000000 00:00 0

It does not crash with valgrind but it still detect it:

[nix-shell:~/Projects/batsim/build]$ valgrind ./batsim -p ../platforms/cluster512.xml -m master_host0   -w ../workload_profiles/stupid.json
==12409== Memcheck, a memory error detector
==12409== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
==12409== Using Valgrind-3.12.0 and LibVEX; rerun with -h for copyright info
==12409== Command: ./batsim -p ../platforms/cluster512.xml -m master_host0 -w ../workload_profiles/stupid.json
==12409==
[0.000000] [batsim/INFO] Workload 'd6911d' corresponds to workload file '/home/adfaure/Projects/batsim/build/../workload_profiles/stupid.json'.
[0.000000] [workload/INFO] Loading JSON workload '/home/adfaure/Projects/batsim/build/../workload_profiles/stupid.json'...
[0.000000] [workload/INFO] JSON workload parsed sucessfully. Read 40 jobs and 3 profiles.
[0.000000] [workload/INFO] Checking workload validity...
[0.000000] [workload/INFO] Workload seems to be valid.
[0.000000] [batsim/INFO] Checking whether SMPI is used or not...
[0.000000] [batsim/INFO] SMPI will NOT be used.
[0.000000] [xbt_cfg/INFO] Switching to the L07 model to handle parallel tasks.
[0.000000] [machines/INFO] Creating the machines from platform file '../platforms/cluster512.xml'...
[0.000000] [machines/INFO] Looking for master host 'master_host0'
[0.000000] [machines/INFO] Looking for parallel file system host 'pfs_host'
[0.000000] /home/adfaure/Projects/batsim/src/machines.cpp:234: [machines/WARNING] Could not find pfs_host 'pfs_host'!
[0.000000] [machines/INFO] The machines have been created successfully. There are 512 computing machines.
[0.000000] [batsim/INFO] Batsim's export prefix is 'out'.
[0.000000] [batsim/INFO] The process 'workload_submitter_d6911d' has been created.
[0.000000] [batsim/INFO] The process 'server' has been created.
==12409== Warning: client switching stacks?  SP change: 0xffeff59c8 --> 0xe5b7f90
==12409==          to suppress, use: --max-stackframe=68461779512 or greater
[master_host0:workload_submitter_d6911d:(1) 0.000000] [job_submitter/INFO] Nom : d6911d
==12409== Warning: client switching stacks?  SP change: 0xe5b7748 --> 0xedbaf90
==12409==          to suppress, use: --max-stackframe=8403016 or greater
==12409== Warning: client switching stacks?  SP change: 0xedba3a8 --> 0xffeff59c8
==12409==          to suppress, use: --max-stackframe=68453381664 or greater
==12409==          further instances of this message will not be shown.
[master_host0:Scheduler REQ-REP:(3) 0.000000] [network/INFO] Sending '{"now":0.000000,"events":[{"timestamp":0.000000,"type":"SIMULATION_BEGINS","data":{"nb_resources":512,"config":{"redis":{"enabled":false,"hostname":"127.0.0.1","port":6379,"prefix":"default"},"job_submission":{"forward_profiles":false,"from_scheduler":{"enabled":false,"acknowledge":true}}}}}]}'
[master_host0:Scheduler REQ-REP:(3) 0.000000] [network/INFO] Received '{"now":0.0,"events":[]}'
[master_host0:workload_submitter_d6911d:(1) 0.000600] [job_submitter/INFO] taille vecteur : 40
[master_host0:workload_submitter_d6911d:(1) 0.000600] [job_submitter/INFO] IN STATIC JOB SUBMITTER: '{"profile":"10.0","res":3,"id":"d6911d!0","subtime":0.0,"walltime":11.0}'
[master_host0:server:(2) 0.000600] [server/INFO] Server received a message of type SUBMITTER_HELLO:
[master_host0:server:(2) 0.000600] [server/INFO] New submitter said hello. Number of polite submitters: 1
[master_host0:server:(2) 0.001200] [server/INFO] Server received a message of type SCHED_READY:
[master_host0:server:(2) 0.001800] [server/INFO] Server received a message of type JOB_SUBMITTED:
[master_host0:server:(2) 0.001800] [server/INFO] GOT JOB: d6911d 0

[master_host0:server:(2) 0.001800] [server/INFO] Job d6911d!0 SUBMITTED. 1 jobs submitted so far
[master_host0:Scheduler REQ-REP:(4) 0.001800] [network/INFO] Sending '{"now":0.001800,"events":[{"timestamp":0.001800,"type":"JOB_SUBMITTED","data":{"job_id":"d6911d!0","job":{"profile":"10.0","res":3,"id":"d6911d!0","subtime":0.000000,"walltime":11.000000}}}]}'
[master_host0:Scheduler REQ-REP:(4) 0.001800] [network/INFO] Received '{"now":0.0018,"events":[{"type":"EXECUTE_JOB","timestamp":0.0018,"data":{"job_id":"d6911d!0","alloc":"0-2"}}]}'
[master_host0:server:(2) 0.002400] [server/INFO] Server received a message of type SCHED_EXECUTE_JOB:
[a0:job_d6911d!0:(5) 0.002400] [jobs_execution/INFO] Creating task 'phg 0'10.0''
[a0:job_d6911d!0:(5) 0.002400] [jobs_execution/INFO] Executing task 'phg 0'10.0''
[master_host0:server:(2) 0.003000] [server/INFO] Server received a message of type SCHED_READY:
[master_host0:workload_submitter_d6911d:(1) 0.100000] [job_submitter/INFO] IN STATIC JOB SUBMITTER: '{"profile":"5.0","res":1,"id":"d6911d!1","subtime":0.1,"walltime":50.0}'
[master_host0:server:(2) 0.100600] [server/INFO] Server received a message of type JOB_SUBMITTED:
[master_host0:server:(2) 0.100600] [server/INFO] GOT JOB: d6911d 1

[master_host0:server:(2) 0.100600] [server/INFO] Job d6911d!1 SUBMITTED. 2 jobs submitted so far
[master_host0:Scheduler REQ-REP:(6) 0.100600] [network/INFO] Sending '{"now":0.100600,"events":[{"timestamp":0.100600,"type":"JOB_SUBMITTED","data":{"job_id":"d6911d!1","job":{"profile":"5.0","res":1,"id":"d6911d!1","subtime":0.100000,"walltime":50.000000}}}]}'
[master_host0:Scheduler REQ-REP:(6) 0.100600] [network/INFO] Received '{"now":0.1006,"events":[{"type":"KILL_JOB","timestamp":0.1006,"data":{"job_ids":["d6911d!0"]}},{"type":"EXECUTE_JOB","timestamp":0.1006,"data":{"job_id":"d6911d!1","alloc":"0-0"}}]}'
[master_host0:server:(2) 0.101200] [server/INFO] Server received a message of type SCHED_KILL_JOB:
==12409== Invalid free() / delete / delete[] / realloc()
==12409==    at 0x4C2BDEB: free (in /nix/store/pqamax9k1vix5mg82j470ppfbilqjyia-valgrind-3.12.0/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==12409==    by 0x472E44: execute_profile_cleanup(void*, void*) (jobs_execution.cpp:508)
==12409==    by 0x4EC3F94: SIMIX_process_on_exit_runall (in /nix/store/x10hgmzbsd4hwjd7qa3vycg1d8amzdsq-simgrid-batsim/lib/libsimgrid.so.3.13.91)
==12409==    by 0x4EC8B96: SIMIX_process_yield (in /nix/store/x10hgmzbsd4hwjd7qa3vycg1d8amzdsq-simgrid-batsim/lib/libsimgrid.so.3.13.91)
==12409==    by 0x4EDD882: simcall_execution_wait (in /nix/store/x10hgmzbsd4hwjd7qa3vycg1d8amzdsq-simgrid-batsim/lib/libsimgrid.so.3.13.91)
==12409==    by 0x4FF4D6C: MSG_parallel_task_execute_with_timeout (in /nix/store/x10hgmzbsd4hwjd7qa3vycg1d8amzdsq-simgrid-batsim/lib/libsimgrid.so.3.13.91)
==12409==    by 0x473FA0: execute_profile(BatsimContext*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, SchedulingAllocation const*, CleanExecuteProfileData*, double*) (jobs_execution.cpp:92)
==12409==    by 0x4761CB: execute_job_process(int, char**) (jobs_execution.cpp:411)
==12409==    by 0x4ECE448: std::_Function_handler<void (), simgrid::xbt::MainFunction<int (*)(int, char**)> >::_M_invoke(std::_Any_data const&) (in /nix/store/x10hgmzbsd4hwjd7qa3vycg1d8amzdsq-simgrid-batsim/lib/libsimgrid.so.3.13.91)
==12409==    by 0x4EBC511: simgrid::kernel::context::RawContext::wrapper(void*) (in /nix/store/x10hgmzbsd4hwjd7qa3vycg1d8amzdsq-simgrid-batsim/lib/libsimgrid.so.3.13.91)
==12409==  Address 0xfdc2a60 is 0 bytes inside a block of size 24 free'd
==12409==    at 0x4C2BDEB: free (in /nix/store/pqamax9k1vix5mg82j470ppfbilqjyia-valgrind-3.12.0/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==12409==    by 0x5022D60: simgrid::surf::L07Action::unref() (in /nix/store/x10hgmzbsd4hwjd7qa3vycg1d8amzdsq-simgrid-batsim/lib/libsimgrid.so.3.13.91)
==12409==    by 0x4EBFEE0: simgrid::kernel::activity::Exec::~Exec() (in /nix/store/x10hgmzbsd4hwjd7qa3vycg1d8amzdsq-simgrid-batsim/lib/libsimgrid.so.3.13.91)
==12409==    by 0x4EBFF08: simgrid::kernel::activity::Exec::~Exec() (in /nix/store/x10hgmzbsd4hwjd7qa3vycg1d8amzdsq-simgrid-batsim/lib/libsimgrid.so.3.13.91)
==12409==    by 0x4EC1595: simgrid::kernel::activity::ActivityImpl::unref() (in /nix/store/x10hgmzbsd4hwjd7qa3vycg1d8amzdsq-simgrid-batsim/lib/libsimgrid.so.3.13.91)
==12409==    by 0x4EC8386: SIMIX_process_kill (in /nix/store/x10hgmzbsd4hwjd7qa3vycg1d8amzdsq-simgrid-batsim/lib/libsimgrid.so.3.13.91)
==12409==    by 0x4EC98DF: SIMIX_simcall_handle (in /nix/store/x10hgmzbsd4hwjd7qa3vycg1d8amzdsq-simgrid-batsim/lib/libsimgrid.so.3.13.91)
==12409==    by 0x4EB81ED: SIMIX_run.part.76 (in /nix/store/x10hgmzbsd4hwjd7qa3vycg1d8amzdsq-simgrid-batsim/lib/libsimgrid.so.3.13.91)
==12409==    by 0x4EB8C94: SIMIX_run (in /nix/store/x10hgmzbsd4hwjd7qa3vycg1d8amzdsq-simgrid-batsim/lib/libsimgrid.so.3.13.91)
==12409==    by 0x4FF6286: MSG_main (in /nix/store/x10hgmzbsd4hwjd7qa3vycg1d8amzdsq-simgrid-batsim/lib/libsimgrid.so.3.13.91)
==12409==    by 0x43D809: main (batsim.cpp:643)
==12409==  Block was alloc'd at
==12409==    at 0x4C2ABBF: malloc (in /nix/store/pqamax9k1vix5mg82j470ppfbilqjyia-valgrind-3.12.0/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==12409==    by 0x473B46: xbt_malloc (sysdep.h:85)
==12409==    by 0x473B46: execute_profile(BatsimContext*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, SchedulingAllocation const*, CleanExecuteProfileData*, double*) (jobs_execution.cpp:56)
==12409==    by 0x4761CB: execute_job_process(int, char**) (jobs_execution.cpp:411)
==12409==    by 0x4ECE448: std::_Function_handler<void (), simgrid::xbt::MainFunction<int (*)(int, char**)> >::_M_invoke(std::_Any_data const&) (in /nix/store/x10hgmzbsd4hwjd7qa3vycg1d8amzdsq-simgrid-batsim/lib/libsimgrid.so.3.13.91)
==12409==    by 0x4EBC511: simgrid::kernel::context::RawContext::wrapper(void*) (in /nix/store/x10hgmzbsd4hwjd7qa3vycg1d8amzdsq-simgrid-batsim/lib/libsimgrid.so.3.13.91)
==12409==
[master_host0:server:(2) 0.101800] [server/INFO] Server received a message of type SCHED_EXECUTE_JOB:
[a0:job_d6911d!1:(8) 0.101800] [jobs_execution/INFO] Creating task 'phg 1'5.0''
adfaure commented 7 years ago

Finally I think I got wrong: this workload triggers the bug:

   {
    "jobs": [
      {   
        "profile": "10.0", 
        "res": 3,  
        "id": 0,  
        "subtime": 0.0, 
        "walltime": 11.0
      },  
      {   
        "profile": "5.0", 
        "res": 1,  
        "id": 1,  
        "subtime": 0.1, 
        "walltime": 50.0
    }   
    ],  
    "nb_res": 7,  
    "command:": "", 
    "profiles": {
      "5.0": {
        "com": 0,  
        "type": "msg_par_hg", 
        "cpu": 500000000.0
      },  
      "10.0": {
        "com": 0,  
        "type": "msg_par_hg", 
        "cpu": 1000000000.0
      }   
    },  
    "version": 0,  
    "date": "Tue, 11 Mar 2015 9:44:30 +0100", 
    "description": "workload with profile file for test"
  }                    
mpoquet commented 7 years ago

It looks like there is indeed a problem when the same profiles are used. Investigating in issue32 branch.

adfaure commented 7 years ago

I am not sure because in the workload I submitted above there is two jobs which use two different profile.

I wrote my scheduler in rust, but if you want to test it it might not be difficult.

install rust and cargo and this would do the trick:

mkdir rust ; cd rust

#As I am working on it the path are relative in the project description file, so the projects need to e siblings.... 
git clone https://gitlab.inria.fr/adfaure/procset.rs
git clone https://gitlab.inria.fr/adfaure/bat-rust rustbatsim
git clone https://gitlab.inria.fr/adfaure/schedulers

cd schedulers; cargo run --bin killsched
# In one another window 
./batsim -p platforms/cluster512.xml -m master_host0   -w workload_profiles/stupid.json
mpoquet commented 7 years ago

Indeed, the problem I found was unrelated with profiles. Batsim stopped if jobs were killed as soon as they were executed. This problem should be fixed in 9c639df.

Voy a la playa, I'll try to reproduce your bug later ;)

adfaure commented 7 years ago

Thanks!

mpoquet commented 7 years ago

I can only clone the schedulers project :(. Can you change the configuration of the two other projects?

Gitlab is quite annoying about this, only setting the project as public is not enough. To check whether the public configuration is okay, I usually visit the project webpage as an anonymous user and check whether the clone url is displayed.

mpoquet commented 7 years ago

I have an issue with my scheduler and the given workload.

[master_host:server:(2) 0.100000] [server/INFO] Server received a message of type SCHED_KILL_JOB:
*** Error in `batsim': malloc(): memory corruption (fast): 0x00000000022316d0 ***
======= Backtrace: =========
/usr/lib/libc.so.6(+0x722ab)[0x7f98c520c2ab]
/usr/lib/libc.so.6(+0x7890e)[0x7f98c521290e]
/usr/lib/libc.so.6(+0x7ad61)[0x7f98c5214d61]
/usr/lib/libc.so.6(__libc_malloc+0x54)[0x7f98c5216674]
/usr/lib/libgmp.so.10(__gmp_default_allocate+0x9)[0x7f98c77e3899]
/usr/lib/libgmp.so.10(__gmpq_init+0x1e)[0x7f98c77fd31e]
batsim(_ZN5boost14multiprecision8backends12gmp_rationalC2Ev+0x15)[0x547a75]
batsim(_ZN5boost14multiprecision8backends12gmp_rationalaSEe+0x18e)[0x56cdae]
batsim(_ZN5boost14multiprecision6numberINS0_8backends12gmp_rationalELNS0_26expression_template_optionE1EEC2IeEERKT_PNS_11enable_if_cIXaaaaoooosr5boost13is_arithmeticIS7_EE5valuesr7is_sameINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES7_EE5valuesr14is_convertibleIS7_PKcEE5valuentsr14is_convertibleINS0_6detail9canonicalIS7_S3_E4typeES3_EE5valuentsr6detail24is_restricted_conversionISM_S3_EE5valueEvE4typeE+0x3d)[0x55ff4d]
batsim(_ZN23EnergyConsumptionTracer9add_entryEdc+0x1d1)[0x55dde1]
batsim(_ZN23EnergyConsumptionTracer11add_job_endEdi+0x33)[0x55e543]
batsim(_Z14killer_processiPPc+0x85c)[0x59acec]
/usr/lib/libsimgrid.so.3.13.91(_ZNSt17_Function_handlerIFvvEN7simgrid3xbt12MainFunctionIPFiiPPcEEEE9_M_invokeERKSt9_Any_data+0x49e)[0x7f98c7b8942e]
/usr/lib/libsimgrid.so.3.13.91(_ZN7simgrid6kernel7context10RawContext7wrapperEPv+0x12)[0x7f98c7adc8c2]
======= Memory map: ========
[...]
mpoquet commented 7 years ago

It looks like SG cleans the data associated to killed tasks on its own. Does 2fe7739 fix the problem?

adfaure commented 7 years ago

Fixed ! Thank you.

mpoquet commented 7 years ago

Thanks a lot for reporting the issue!