Closed adfaure closed 7 years ago
Finally I think I got wrong: this workload triggers the bug:
{
"jobs": [
{
"profile": "10.0",
"res": 3,
"id": 0,
"subtime": 0.0,
"walltime": 11.0
},
{
"profile": "5.0",
"res": 1,
"id": 1,
"subtime": 0.1,
"walltime": 50.0
}
],
"nb_res": 7,
"command:": "",
"profiles": {
"5.0": {
"com": 0,
"type": "msg_par_hg",
"cpu": 500000000.0
},
"10.0": {
"com": 0,
"type": "msg_par_hg",
"cpu": 1000000000.0
}
},
"version": 0,
"date": "Tue, 11 Mar 2015 9:44:30 +0100",
"description": "workload with profile file for test"
}
It looks like there is indeed a problem when the same profiles are used. Investigating in issue32 branch.
I am not sure because in the workload I submitted above there is two jobs which use two different profile.
I wrote my scheduler in rust, but if you want to test it it might not be difficult.
install rust and cargo and this would do the trick:
mkdir rust ; cd rust
#As I am working on it the path are relative in the project description file, so the projects need to e siblings....
git clone https://gitlab.inria.fr/adfaure/procset.rs
git clone https://gitlab.inria.fr/adfaure/bat-rust rustbatsim
git clone https://gitlab.inria.fr/adfaure/schedulers
cd schedulers; cargo run --bin killsched
# In one another window
./batsim -p platforms/cluster512.xml -m master_host0 -w workload_profiles/stupid.json
Indeed, the problem I found was unrelated with profiles. Batsim stopped if jobs were killed as soon as they were executed. This problem should be fixed in 9c639df.
Voy a la playa, I'll try to reproduce your bug later ;)
Thanks!
I can only clone the schedulers
project :(. Can you change the configuration of the two other projects?
Gitlab is quite annoying about this, only setting the project as public is not enough. To check whether the public configuration is okay, I usually visit the project webpage as an anonymous user and check whether the clone url is displayed.
I have an issue with my scheduler and the given workload.
[master_host:server:(2) 0.100000] [server/INFO] Server received a message of type SCHED_KILL_JOB:
*** Error in `batsim': malloc(): memory corruption (fast): 0x00000000022316d0 ***
======= Backtrace: =========
/usr/lib/libc.so.6(+0x722ab)[0x7f98c520c2ab]
/usr/lib/libc.so.6(+0x7890e)[0x7f98c521290e]
/usr/lib/libc.so.6(+0x7ad61)[0x7f98c5214d61]
/usr/lib/libc.so.6(__libc_malloc+0x54)[0x7f98c5216674]
/usr/lib/libgmp.so.10(__gmp_default_allocate+0x9)[0x7f98c77e3899]
/usr/lib/libgmp.so.10(__gmpq_init+0x1e)[0x7f98c77fd31e]
batsim(_ZN5boost14multiprecision8backends12gmp_rationalC2Ev+0x15)[0x547a75]
batsim(_ZN5boost14multiprecision8backends12gmp_rationalaSEe+0x18e)[0x56cdae]
batsim(_ZN5boost14multiprecision6numberINS0_8backends12gmp_rationalELNS0_26expression_template_optionE1EEC2IeEERKT_PNS_11enable_if_cIXaaaaoooosr5boost13is_arithmeticIS7_EE5valuesr7is_sameINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES7_EE5valuesr14is_convertibleIS7_PKcEE5valuentsr14is_convertibleINS0_6detail9canonicalIS7_S3_E4typeES3_EE5valuentsr6detail24is_restricted_conversionISM_S3_EE5valueEvE4typeE+0x3d)[0x55ff4d]
batsim(_ZN23EnergyConsumptionTracer9add_entryEdc+0x1d1)[0x55dde1]
batsim(_ZN23EnergyConsumptionTracer11add_job_endEdi+0x33)[0x55e543]
batsim(_Z14killer_processiPPc+0x85c)[0x59acec]
/usr/lib/libsimgrid.so.3.13.91(_ZNSt17_Function_handlerIFvvEN7simgrid3xbt12MainFunctionIPFiiPPcEEEE9_M_invokeERKSt9_Any_data+0x49e)[0x7f98c7b8942e]
/usr/lib/libsimgrid.so.3.13.91(_ZN7simgrid6kernel7context10RawContext7wrapperEPv+0x12)[0x7f98c7adc8c2]
======= Memory map: ========
[...]
It looks like SG cleans the data associated to killed tasks on its own. Does 2fe7739 fix the problem?
Fixed ! Thank you.
Thanks a lot for reporting the issue!
Hello, I found a bug that I might have found the source, but since I am not very familiar with the design of batsim I not sure If I can fix it.
To reproduce the bug it is very simple, I have a scheduler which do the following steps:
This basic algorithm will fail with a
Error in ./batsim': malloc(): memory corruption (fast)
if the job use a profile used by another job.Here is the full trace:
It does not crash with valgrind but it still detect it: