Open philipstarkey opened 7 years ago
Original comment by Chris Billington (Bitbucket: cbillington, GitHub: chrisjbillington).
What resolution are your images? You might find that 4000 images might take up more memory than your computers have.
An image cache is not a bad idea, but it would only be of use if the images fitted in memory, otherwise they'd be swapping out to disk anyway. Furthermore, if you're using the same images repeatedly, your operating system is likely to be caching the files already to the best of its ability, and if it's still very slow it's evidence that perhaps they don't all fit in RAM.
Lyse does have a place you can put things to be kept from one run of your analysis to the next, but if you remove and re-add the analysis routine it is not kept. But, as a test, you could try caching your images in the analysis subprocess with something like this:
#!python
def get_image(filepath):
# Make a cache if this is the first run:
if not hasattr(lyse.routine_storage, 'images'):
lyse.routinestorage.images = {}
if filepath in lyse.routine_storage.images:
# Get the image from the cache if it's there:
return lyse.routine_storage.images[filepath]
else:
# Get it the usual way if it's not:
image = get_the_image_the_usual_way(filepath)
# Put it in the cache for next time:
lyse.routine_storage.images[filepath] = image
return image
If that speeds things up then looking further into providing a persistent, cross-process cache like that dataframe might be useful. Otherwise perhaps not.
Original comment by Jan Werkmann (Bitbucket: PhyNerd, GitHub: PhyNerd).
They are 512x512 pixels or less so memory shouldn't be a problem.
This minimal solution already came to mind, but it would keep images in memory even after the were deleted from the lyse Filebox. This could get really full from one measurement to the next, when not emptied. And each script would need it's one cache causing even more memory usage if I'm not mistaking. Also the cache would only be populated once the script runs. So there would be no speed improvement if a script is added after all files were loaded and or only runs once. The dataframe-like solution seems superior to me. I also know that this can quickly become a memory issue and is not something everybody needs, but maybe this could become a optional feature in the future that can be switched on an off?
I will definitely give your solution a try though and maybe add some logic to remove shots that are not in the current dataframe anymore.
Original comment by Chris Billington (Bitbucket: cbillington, GitHub: chrisjbillington).
I think I'd broadly support a feature like that - or a way to cache datasets generally regardless of what they are. They can't go in the dataframe easily, as the dataframe is serialised and deserialised repeatedly, but we could with not too much difficulty make a cache that send the images in a binary format - I've sent numpy arrays over zeroMQ sockets (which we use) before and that works well.
Original comment by Jan Werkmann (Bitbucket: PhyNerd, GitHub: PhyNerd).
Thats why I said dataframe-like because I wanted something that behaved in the way the dataframe currently does in the following ways:
It's not really a urgent thing, but just something I wanted to throw out there for discussion. This could have the potential of speeding up things quite a bit when repeatedly using images or other data thats not in the dataframe.
Original comment by Chris Billington (Bitbucket: cbillington, GitHub: chrisjbillington).
Yep, fair enough!
Original comment by Jan Werkmann (Bitbucket: PhyNerd, GitHub: PhyNerd).
Ok so I've been playing around with this and here is what I came up with till now:
#!python
class StorageServer(ZMQServer):
def __init__(self, *args, **kwargs):
super(StorageServer, self).__init__(*args, **kwargs)
self.storage = {}
def handler(self, request_data):
command, keys, data = request_data
storage = self.storage
if len(keys) > 1:
for key in keys[:-1]:
if key not in storage:
storage[key] = {}
storage = storage[key]
if command == "get":
try:
return storage[keys[-1]]
except KeyError:
return None
except IndexError:
return storage
elif command == "set":
storage[keys[-1]] = data
return True
elif command == "remove":
try:
del storage[keys[-1]]
except KeyError and IndexError:
return False
return True
elif command == "clear":
del self.storage
self.storage = {}
This is a generic StorageServer, that stores data in nested dicts. It allows for saving, getting and deleting data as well as clearing the whole storage. The server expects message of the form [command, index, data]. Command being one of get, set, remove or clear. Index is expected to be a list of Indexes. Data is the data that gets stored otherwise this should be None.
For a proof of concept I modified the Run.get_images:
#!python
def get_image(self, orientation, label, image):
storage_port = 9999
img = zmq_get(storage_port, 'localhost', ['get', ['images', orientation, label, image, self.h5_path], None])
if img is None:
with h5py.File(self.h5_path) as h5_file:
if not 'images' in h5_file:
raise Exception('File does not contain any images')
if not orientation in h5_file['images']:
raise Exception('File does not contain any images with orientation \'%s\'' % orientation)
if not label in h5_file['images'][orientation]:
raise Exception('File does not contain any images with label \'%s\'' % label)
if not image in h5_file['images'][orientation][label]:
raise Exception('Image \'%s\' not found in file' % image)
img = array(h5_file['images'][orientation][label][image])
zmq_get(storage_port, 'localhost', ['set', ['images', orientation, label, image, self.h5_path], img])
return img
and also implemented a get_images function for Multishot routines:
#!python
def get_images(host='localhost', timeout=5):
port = 9999
return zmq_get(port, host, ['get', ['images'], None], timeout)
Singleshot routines are not really effected when it comes to speed as they usually just run once. When using get_images instead of something like {run.get_image(...) for path, run in seq.runs.items()} this holds a great speed increase (if the storage has already been filled).
So we still have to solve the problem of filling the storage upon loading a shot in a away that keep this feature optional for everyone who doesn't need it (so their memory doesn't suffer). Any ideas how this could be done? The server could also be used for other cross analysis routine storage. So we might want to think about some sort of API to make this available.
I'm also open to the idea of running a singleshot script, that does nothing but adding the images to the cross routine storage as this is a simple fix for our problem and doesn't bloat everyones memory. But non the less we would need a API for the storage.
Any thoughts or ideas for improvement?
Original comment by Jan Werkmann (Bitbucket: PhyNerd, GitHub: PhyNerd).
Did a bit of playing around with the storage. Storing Runs in a dict in Storage instead of creating a sequence object(that then internally creates lots of runs) every time also holds a great speed improvement for lange amounts of shots. This might also be worth looking into.
Original comment by Jan Werkmann (Bitbucket: PhyNerd, GitHub: PhyNerd).
I created a branch for my corss routine storage/cached images and runs stuff over on my repo.(here)
The storage is a extension to Lyses server. The storage is made up of nested dicts, allowing hierarchical indexes and stores anything that zmq_get will send. One can save, get and delete enteries from storage.
I also added the option to automatically cache images (on first load) and multishot sequence runs (on first creation). This gives me a drastic speed increase in my multishot scripts. If a shot gets removed from lyse the cached entries from that shot are removed as well to reduce memory usage.
We are already running lyse with these changes in our lab. Before caching one of our scripts using images (with 2000 shots loaded equaling 4000 images) ran for 17 seconds after the change it is running sub 1 second.
The options for caching(on/off and timeout) are currently a variable that is hardcoded in the init file. I'm not that happy with having it hardcoded but also not sure where to put it. I would personally opt for putting it into labconfig.
Any ideas for improvement or ideas on where to put the option to enable/disable caching? I'd also like to create a pull request for this sometime in the near future (but have too many open pull requests at the moment as is) so input is much appreciated.
Original report (archived issue) by Jan Werkmann (Bitbucket: PhyNerd, GitHub: PhyNerd).
When loading a lot of images (about 4000 or more is quite uncommon in our lab) in multishot routines this takes a great amount of time. It would be nice if there was a second dataframe-like variable that stored the images so that loading them becomes quick.