saxtouri commented 6 years ago

This issue is critical from an infrastructure perspective. Please, let me know what you think:

Some of the docker images are unexpectedly large (e.g., 7-8 GBs or more). Because of that:

the system fails to execute these jobs due to timeout
even if we override the timeout issue by pre-downloading the huge image, the cluster nodes get full very quickly, rendering the system unusable

I think we should reconsider our policy on the docker image sizes and set size restrictions. If we don't, we will experience a lot of failures and our storage resources will get consumes really fast. Also, if we run the system on a commercial cluster, the cost of the infrastructure will rise too high very quickly and it will be nearly impossible to scale.

I wonder how images get so big. What kind of software code requires so many GBs, considering that desktop OSs overloaded with heavy apps can be bundled in just 5 to 10 GBs. Is it just software or do we allow people to load images with data as well?

Putting any kind of data inside the images is a highly inefficient practice because it makes our system too expensive to run (I can explain this further, if you like).

What do you think we should do?

dumitrescustefan commented 6 years ago

Hi, I actually asked something like that on the forum. For example we provide a multilingual tool to perform tokenization, sentence splitting, POS and parsing on 50+ languages. Even if the neural network model itself has ~50-100MB per language, we still need vector embeddings to perform at state-of-the-art levels. For example fasttext embeddings have around 0.5-1.0 GB per language. Times 50, that makes a docker file over 50GB, say the least.

For the current version of the component we removed vector embedding completely and we loose ~1-2% accuracy, at a 10x size reduction. So the size question is very valid.

So we either provide one docker image per language (including vector embeddings should max at around 2GB), or a single image as now at about 5GB without vectors.

greenwoodma commented 6 years ago

The problem is that we have a policy of including all the data with the component as it is the only way to ensure reproducability. If we allowed images to download or otherwise access remote data then a) we'd need to allow possibly arbitrary internet access which is a security risk we didn't want to take and b) we would not be able to guarantee reproducability which completely negates one of the main selling points of the platform.

The current trend in TDM seems to be deep learning and that often results in huge models which need to be packaged with the components. If we want to support all forms of TDM then we need to support this, which will require either large docker images (components and data together) or the data would need to be uploaded to the registry as a separate OMTD resource and then passed to the docker image somehow. Would this second approach be better from a resource perspective @saxtouri? I would imagine it would be technically more difficult as you'd need to download the resource as well as the docker image and then make the resource available inside the running docker instance.

@dumitrescustefan in your case I would suggest one docker image per language anyway, because once as far as I know once the docker image is registered as an application you won't be able to change the language param (assuming it's a param that chooses which language to use) so you'd have to register the full image for each language anyway which would be wasteful. Instead you could register the per language images with the full embeddings and still have reasonably small images for download.

reckart commented 6 years ago

If the resources are downloaded from a trusted repository (e.g. a trusted Maven repository) and not just from an arbitrary URL, IMHO it is a valid approach.

greenwoodma commented 6 years ago

@reckart true but there are limits to how much some of those repos allow you to upload; maven central doesn't have a fixed limit but they do state they will look at any very large uploads and may remove them. After all they are a code repository not a data repository. We'd also need to ensure that any repo we did support had the same limits as maven central; i.e. once uploaded the item is fixed and cannot be changed or deleted otherwise we have the same issue about reproducability.

If we take this approach though we then get into another discussion. How is bandwidth into the cluster charged? It may not be but commercial cloud offerings would probably charge for that, and so downloading huge models, possible each time we wanted to run a component (given that the docker images are static etc.) could also run up huge costs, and could be very slow and fragile depending on the connection etc.

reckart commented 6 years ago

True, there are strings attached and further considerations to be taken. Mind that e.g. the NER models provided by the FREME EU project which can be used via DKPro Core are also ~1 GB per language. So we have precedence. The models are hosted on the UKP Maven repo.

reckart commented 6 years ago

With respect to Maven repos, IMHO OMTD should set up a proxy repo which next to the platform to ensure that artifacts are cached and do not always have to downloaded remotely. I think I already mentioned that in another mail/issue. That would also mitigate the reproducibility problem because the platform keeps a copy.

saxtouri commented 6 years ago

From a non-MTD perspective, I understand that our tools/components need three things:

software
auxiliary data (e.g. dictionaries)
input data

We currently keep 1 (software) and 2 (auxiliary data) packed in the docker image. We let the user upload 3 (input data) to the system, which ends up being stored in a shared space, accessible by the docker containers.

What we could do

If we could separate 1 (software) from 2 (auxiliary data), we could also store the second to a shared space, accessible by all docker containers, and let the containers read it. That would solve the problem, as long as the storage system can detect and/or avoid redundancy.

What we must NOT do

We must not be downloading 2 (the auxiliary data) to the execution node itself, every time a docker image is dispatched. It can break the system.

greenwoodma commented 6 years ago

every time a docker image is dispatched. It can break the system.

@saxtouri I'm not sure I follow this. Are you saying that the simple action of downloading a docker image ready to execute it is so fragile that it can pull down the entire system? If that's the case then we have bigger problems than the size of the docker images.

saxtouri commented 6 years ago

@greenwoodma No, sorry for the exaggeration. What I meant was that if someone runs a few of this gigantic docker images, it will render the cluster nodes unusable ("no space left on device"), until we clean it up manually.

greenwoodma commented 6 years ago

@saxtouri phew, you had me worried for a moment.

Can we not automatically clean up and remove the docker images once an execution has completed

pennyl67 commented 6 years ago

@saxtouri @greenwoodma The auxiliary data can be uploaded to the registry separately - we have the i/f for adding the (in)famous ann,otation resources & models & grammars. But as far as I know we don't have yet any code for downloading them from where they are, storing them and feeding them to the components.

saxtouri commented 6 years ago

@greenwoodma I will work on this idea of automatically cleaning up the images when they have finished executing. Obviously is not an optimal solution, images will have to be redownloaded every time, but it could relieve the system from getting out of space.

@pennyl67 @galanisd I could create a shared storage space where the Registry will store auxiliary data. If the Workflow Service is aware of what auxiliary data correspond to what docker image, it could configure the component execution so that the container will be able to read the data as it is being executed.

Is this doable?

pennyl67 commented 6 years ago

@saxtouri @galanisd @greenwoodma Not as easy as it sounds: we need to ensure that the auxiliary data is uniquely identified - can be done via the metadata (resourceIdentifier) but I don't know how easy it is to pass all this info through the registry, workflow etc.

saxtouri commented 6 years ago

@pennyl67 @galanisd @greenwoodma Since we brainstorm, another idea is to use a shared space just for pulled images and configure docker to use it. Due to the awesome way docker stores pulled images, this idea won't be hard to implement.

Still, we might have some redundancy issues when the same image is pulled by two hosts. A solution to this problem could be to pre-pull the image to the shared space before executing the task.

This could work, right?

It could even solve the Maven issue described by @reckart on this discussion.

greenwoodma commented 6 years ago

sounds like a sensible solution to me @saxtouri -- I don't know enough about docker or the cloud setup to really help, but if you think it should work and it solves your concerns then it gets my vote!

saxtouri commented 6 years ago

OK, I will run an experiment today to check if it works. If everything seems right, I will apply the recipe on "testing" tomorrow.

Even if it does work, though, we need to make changes to Registry and/or Workflow Service to ensure that docker images will be pulled before they are executed.

gkirtzou commented 6 years ago

To get an idea of the docker images sizes that come from Open Call II

IXA_pipes (ragerri/de-ixa-pipes-omtd:0.1.0) : 400 MB. IXA has 8 different images one per language
Alex (abravp/openminted_scisumservices:1.0.0) 26.1GB
BOLSTM (andrelamurias/bolstm:omtd) 8.66 GB
Jones - Pubrunner hackathon (jakelever/pubrunner:0.3.1), 1.3GB
Taln - Freeling (taln/openminted_freeling:1.0) 6.77GB
Taln - BabelNetExtractor (taln/openminted_babelnet:1.0) 26.6GB
UPFMT (dumitrescustefan/upfmt) 2.46 GB
FIMDA (erechtheus/fimda:0.2.2) 110 MB

antleb commented 6 years ago

We decided that the auxiliary resources should be bundled with the executable code inside the docker images because we didn't have enough time to design and implement a way to pass resources to the docker image (a way that covers all cases: file formats, one or more files per resource, etc). We have 2 ways to handle the size problem:

make no assumptions about the resource's format and when they are registered to the registry, simply download whatever is in the URL to a shared space. This space will become the input of the running docker image. There are many implications with this approach and we need to change the code in various places in the registry, we need to finish the implementation of the Pithos version of the store service, modify the workflow service to take these directories under account, modify the generated galaxy wrappers(?)
follow @saxtouri suggestion: keep the resources inside the docker image but optimize the way the docker images are stored. This is the simplest solution to implement right now. We keep the first solution as a possible todo after the rush to finish everything.

pennyl67 commented 6 years ago

@saxtouri @greenwoodma @antleb I think we all agree that the simplest solution must be followed. So, waiting for @saxtouri 's experiment to (hopefully!) work.

saxtouri commented 6 years ago

Bad news everyone, the experiment did not work, at least initially. The docker storage manager we have installed on our setup does not support image sharing between different docker engines, and this is by design. Although images consist of the same sha256 blocks, the storage manager uses some random-looking ids to point at them, which differ from engine to engine.

The hope is not lost, though. Maybe we can configure the storage manager to store images in a different way, or switch to another storage manager. Docker supports a few storage managers out of the box. I don't know if any of them supports our use case, but I will find out soon.

saxtouri commented 6 years ago

OK, I think I found something: devicemapper [2] is one of the many docker engine storage drivers [1] and it stores data in a single binary file rather than just in blocks. I did some tests with two hosts (so, two docker engines sharing the same shared directory) and it worked. It is a bit slower when docker run, especially if a container performs a lot of writes, but if an image is pulled once, it will much quicker to launch it on an other slave node, regardless of where it was initially pulled.

There is also another option called vfs which is more robust, they say, but slower. I didn't have any problems with devicemapper so far, so I suggest we go with that. I would like to testvfs as well, but I think we should skip it for another iteration.

I suggest we should try a shared pull space with devicemapper on "testing". We can do this tomorrow, without disrupting the open call, but we will need to take some slaves offline from time to time. (we will always have at least two operational slaves)

Do you people agree with my suggestion? Is there a reason NOT to try this on testing?

[1] https://docs.docker.com/storage/storagedriver [2] https://docs.docker.com/storage/storagedriver/device-mapper-driver

antleb commented 6 years ago

It's fine by me if there are no interruptions but only slowdowns.

saxtouri commented 6 years ago

Oh no I spoke too soon! I tested with only two slaves, but when I add a third, the system became so slow that all runs timed out...

I have another idea though: we can reserve a part of the cluster just for overweight components. No fancy storage drivers, just one or two slaves with extra storage dedicated to run huge images. If we know what components require huge images to run, we already know how to redirect them.at a specific part of the cluster while the rest of the components will default to the rest of the cluster. That would solve the problem without slowing down the rest of the system. Also, it is something we can implement in 1 or 2 days.

Sorry for the bipolar posts...

pennyl67 commented 6 years ago

Thanks @saxtouri; the good thing is that @gkirtzou has also taken it up on herself to find more info for each component (storage, memory, etc.) and we'll get this info asap. And we need a solution very fast to do the testing. To me this sounds viable at least for this phase - but I'm not a tech expert. @antleb and @galanisd?

saxtouri commented 6 years ago

@pennyl67 I will test my second idea on the testing cluster while waiting @gkirtzou to complete the image size survey.

Note: IMO, this is not a strictly technical issue, because it affects policies and operations on the registration level. When we will get live, we will need a way to distinguish between "fit" and "overweight" components and manage them accordingly.

openminted / Open-Call-Discussions

Docker image size restritions #38

What we could do

What we must NOT do