s5z / zsim

A fast and scalable x86-64 multicore simulator
GNU General Public License v2.0
337 stars 186 forks source link

Dynamic core pinning #233

Closed sc2682cornell closed 4 years ago

sc2682cornell commented 4 years ago

Hi, I am trying to implement dynamic core management where I can set which cores each process runs on at runtime. At this point, I have a fixed policy that whenever I see some functions, I migrate the entire process from one set of cores to another. i.e., I hope to run function X on core set B, while the rest of the program on core set A.

As there is already a field called mask for each process, I first add a new filed called newmask for each process, which represents the new set of cores this process can run on (so mask will be A and newmask will be B in the example). I then add two zsim hooks at the beginning and end of function X (in the source codes), which denote the start and end of this migration. FInally I try to add support of process migration in scheduler.h. I followed sync() in scheduler.h that handles context switches, and my code looks like this (I have another migrateBack() function that is similar to this one):

uint32_t migrate(uint32_t pid, uint32_t tid) { // migrate this thread to another set of cores futex_lock(&schedLock); uint32_t gid = getGid(pid, tid); assert((gidMap.find(gid) != gidMap.end())); ThreadInfo* th = gidMap[gid]; ContextInfo* ctx = &contexts[th->cid]; zinfo->cores[th->cid]->leave(); deschedule(th, ctx, QUEUED); freeList.push_back(ctx); th->updateMask(); // update the core mask of this thread to newmask ctx = schedThread(th); if (ctx) { schedule(th, ctx); zinfo->cores[ctx->cid]->join(); bar.join(ctx->cid, &schedLock); info("switched to core %u", ctx->cid); } else { runQueue.push_back(th); waitForContext(th); } assert(th->state == RUNNING); return th->cid; }

However, I am getting ACCESS_INVALID_ADDRESS after a thread is migrated to a new core, and the exception is thrown from insWindow.schedule(). I'm not sure what this exception means. Is there any memory leak? I am also getting a deadlock when another thread is scheduled to the old core after the previous thread is migrated to a new core, and the deadlock happens in sync() function. I'm not sure if the second problem is related to the first one.

I am having trouble debugging what's wrong. I would appreciate any advice that may be helpful. Thanks!

gaomy3832 commented 4 years ago

The standard way in Linux to pin threads to cores is through the syscalls sched_set/getaffinity. My pull request #114 tried to virtualize these syscalls. You can just replace your hooks with these standard syscalls to migrate the threads/processes. Try it out. Let me know if you find any bugs.

sc2682cornell commented 4 years ago

Hi Mingyu, I checked out that pull request before. My understanding is that it basically updates masks of each thread, but mask is only used in schedThread() which is called only from sync() and join(), when a context switch is in need or when a thread is not RUNNING. If mask is updated when a thread T is RUNNING, and the updated mask is to make T migrate to an idle core (i.e., there is no existing thread that we can set T as its handoffthread), then none of the situations above match, so simply updating mask will not trigger the migration. Did I miss anything?

gaomy3832 commented 4 years ago

This works as following: when the thread calls sched_setaffinity, zsim enters the callback SyscallEnter() (https://github.com/gaomy3832/zsim/blob/5d0cb342309b1d3d7f0160b812d2c95c029f89b4/src/zsim.cpp#L888). It first executes the pre-syscall patches, which is PatchSchedSetaffinity() (https://github.com/gaomy3832/zsim/blob/5d0cb342309b1d3d7f0160b812d2c95c029f89b4/src/virt/cpu.cpp#L104). In this function, we update the masks.

Then, SyscallEnter() does a syscallLeave(). In it, we check if the mask is valid for the current core. If not, we force a true leave (https://github.com/gaomy3832/zsim/tree/5d0cb342309b1d3d7f0160b812d2c95c029f89b4/src/scheduler.cpp#L284). In leave(), if the mask is invalid, we will deschedule the core and transit it into BLOCKED state (https://github.com/gaomy3832/zsim/blob/5d0cb342309b1d3d7f0160b812d2c95c029f89b4/src/scheduler.h#L375). The next time the core calls join(), it will check the mask in schedThread() (https://github.com/gaomy3832/zsim/blob/5d0cb342309b1d3d7f0160b812d2c95c029f89b4/src/scheduler.h#L315) and go to the correct core.

Do you agree the approach is reasonable? Or I am missing more things?

sc2682cornell commented 4 years ago

This makes a lot of sense! I focused too much on the changes in scheduler.h, and missed the callback. I'll try your patch out and see if there's any problem. Thanks!

sc2682cornell commented 4 years ago

It works well so far. Thanks @gaomy3832 !

sc2682cornell commented 4 years ago

@gaomy3832 Hi Mingyu, I wonder how I could change core allocation outside of this process; functionality like using "taskset" to reset core allocation. My understanding is that your patch currently requires calling schedThread() in the application, which means the reallocation is kind of static. Could it be extended to dynamically resetting cores?

gaomy3832 commented 4 years ago

That was already supported in the original zsim release even without my patch. See the mask option in each processN config. https://github.com/s5z/zsim/blob/fb4d6e0475a25cffd23f0687ede2d43d96b4a99f/src/process_tree.cpp#L186 However this only works per-process, not per-thread.

sc2682cornell commented 4 years ago

@gaomy3832 But this mask has to be defined in .cfg and cannot be changed at runtime, right?

gaomy3832 commented 4 years ago

That's true.

OK. Now I understand. You want to have a separate control process to migration the worker processes. I do not think it is supported now. You might be able to expose additional interface as hooks that can be called by the control process. Internally, the hooks will do something similar to the SYS_get/setaffinity syscall patch. That should not be too difficult. Also be careful with races.

sc2682cornell commented 4 years ago

Right, that's what I thought. I just wanted to make sure I don't reinvent the wheel. Thanks!