uzh / vm-mad

Dynamically grow or shrink GridEngine clusters using cloud-based nodes
https://arxiv.org/abs/1302.2529
Apache License 2.0
3 stars 2 forks source link

persist VM state #15

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
VM invocations (i.e., which VMs were started on Amazon, which AppPot
jobs were submitted to GC3Pie) are not saved across `Orchestrator`
runs.

Therefore, if an `Orchestrator` instance is stopped, it will lose
control of running VMs (i.e., they have to be stopped by hand), and a
newly started `Orchestrator` will not recognize that a few VM nodes
are already available.

The code is almost ready for implementing persistence, tho:

0. Add a `statusfile` parameter to the `Orchestrator`
1. Use `pickle` to save the `.vms` attribute in an `Orchestrator` class to the 
status file.
2. Upon class initialization, re-create the dependent structure 
(`_pending_auth`, `_vms_by_nodename`) from the restored `vms`
3. At the end of each cycle, save the state.

Original issue reported on code.google.com by riccardo.murri@gmail.com on 26 Apr 2012 at 9:58

GoogleCodeExporter commented 9 years ago
I started working on this issue. 
When I try to dump the self.vms to a file I get the following error: 

pickle.dump(self.vms, persist_file) 
RuntimeError: dictionary changed size during iteration

This happens when the 2nd VM should start. 
Am'I supposed to clean the file's content on each iteration? 

Original comment by tyanko.a...@gmail.com on 16 May 2012 at 2:47

GoogleCodeExporter commented 9 years ago
| When I try to dump the self.vms to a file I get the following error:
|
| pickle.dump(self.vms, persist_file)
| RuntimeError: dictionary changed size during iteration

This is an error you normally get in Python when you loop over a
sequence (e.g., a list or a dictionary), and modify the sequence
within the body of the loop.  For instance:

    for key in a_dict:
        if key.startswith('_'):
            del a_dict[key]

(In your case, the loop is quite surely contained in `pickle.dump` so
you did not code it explicitly.)

What is likely happening in the Orchestrator code is rather that
`self.vms` is being concurrently modified by some other threads
(worker threads responsible for starting/stopping VMs).

So you have two options:

1) Use locks to prevent concurrent access to `self.vms` while you are
   saving it.

2) Save a *copy* of `self.vms`; of course you need to make the copy
   operation *atomic*.  However, this is rather easy in Python, as the
   `dict()` constructor can be passed a `dict` instance and returns a
   copy of it, and the whole process happens in C code, thus
   it's atomic from the Python POV.  So this should work:

        pickle.dump(dict(self.vms))

I'd go with 2), as it's easier to implement and shallow copies are
rather inexpensive.

Original comment by riccardo.murri@gmail.com on 17 May 2012 at 10:58

GoogleCodeExporter commented 9 years ago
The "pickle.dump(dict(self.vms))" worked fine for me and the VMs are now 
persisted on each cycle. What is the best way to proceed?

- Should the name of the file be fixed or it should vary? 
- What checks should be done during start-up of the Orchestrator:
 - if pickle file exist?
 - if yes, just load it because if I got it right the process is quite transparent and the "old VMs" are considered as normal "already running" VMs and they go through the orchestration process without further bothers?

Original comment by tyanko.a...@gmail.com on 5 Jun 2012 at 10:44

GoogleCodeExporter commented 9 years ago
| The "pickle.dump(dict(self.vms))" worked fine for me and the VMs are now
| persisted on each cycle. What is the best way to proceed?
|
| - Should the name of the file be fixed or it should vary?

It should be settable with an `Orchestrator` constructor parameter,
with a sensible default (e.g., `orchestrator.state`).

If `None` is passed to the constructor, then VM persistence is
disabled: no file is saved or loaded.

| - What checks should be done during start-up of the Orchestrator:
|  - if pickle file exist?

If the file does not exist, the Orchestrator should create it.
(Otherwise we have a bootstrap problem: in order to save state into a
file, we need to have a state file already.)

|  - if yes, just load it because if I got it right the process is quite
| transparent and the "old VMs" are considered as normal "already running" VMs
| and they go through the orchestration process without further bothers?

Not sure I understand the question, but I'd say "yes": if a state file
exists, just load it.  However, if pickle.load() gives an error then
stop immediately with an error message to the user.

Original comment by riccardo.murri@gmail.com on 5 Jun 2012 at 10:54

GoogleCodeExporter commented 9 years ago
Fixed in SVN r136.

Original comment by riccardo.murri@gmail.com on 26 Nov 2012 at 2:14

GoogleCodeExporter commented 9 years ago

Original comment by riccardo.murri@gmail.com on 26 Nov 2012 at 2:15