jhorey / ferry

Ferry lets you define, run, and deploy big data applications on AWS, OpenStack, and your local machine using Docker
http://ferry.opencore.io
Apache License 2.0
252 stars 25 forks source link

mechanism to customize backend images #9

Open iosusan opened 10 years ago

iosusan commented 10 years ago

Another nice to have feature would be to somehow add extension capabilities on the backend images in a way that a user could add some additional libraries over the already built base images. Essentiallty the ability to support customized images, either by replacing the base ones or by extending them

km4rcus commented 10 years ago

Is there any actual procedure that I can use to modify images? In particular, I am interested at openmpi; I have to install some scientific libraries that must be common to all the compute nodes.

Ravenwater commented 10 years ago

+1

particularly with MPI programs, you tend to need custom libs for each non-trivial application. Conceptually, for the MPI world, I think we need two layers: 1- a base layer with images that create the MPI fabric. OpenMPI is one, but MVAPICH is better when the networking is IB. 2- individual, application specific images that can create a parallel instance of the application

The base layer needs to progress with the MPI implementation bug fixes, and get augmented with the ability to specialize the images for MPI applications. There is bound to be some common steps that we can automate as we learn how to deploy MPI applications through containers.

The application images need to progress with the application features and bug fixes.

jhorey commented 10 years ago

There isn't a very good way to modify the backend images (storage or compute) right now. Conceptually it is pretty simple:

  1. Compile a new Docker image using one of the existing storage/compute images as a base
  2. Modify the application stack YAML file to indicate the new image

The trick will be notifying Ferry how to configure this new backend. Right now it uses hard-coded logic; if Ferry sees that you've configured "openmpi" it fetches the Open MPI configurator (which in turn generates the necessary configuration files). I am thinking of something like this:

backend:

So basically users can provide an optional "image" parameter. Ferry will still use the "personality" parameter to figure out how to configure the service, but will instantiate the customized image. Thoughts?

On Wed, Oct 22, 2014 at 9:21 AM, Theodore Omtzigt notifications@github.com wrote:

+1

particularly with MPI programs, you tend to need custom libs for each non-trivial application. Conceptually, for the MPI world, I think we need two layers: 1- a base layer with images that create the MPI fabric. OpenMPI is one, but MVAPICH is better when the networking is IB. 2- individual, application specific images that can create a parallel instance of the application

The base layer needs to progress with the MPI implementation bug fixes, and get augmented with the ability to specialize the images for MPI applications. There is bound to be some common steps that we can automate as we learn how to deploy MPI applications through containers.

The application images need to progress with the application features and bug fixes.

— Reply to this email directly or view it on GitHub https://github.com/opencore/ferry/issues/9#issuecomment-60083138.

Ravenwater commented 10 years ago

With your suggestion/answer I realize that I don't understand the scope and capability of the ferry orchestration language. Is there a pointer to some blueprints/documents/scribbles of the scope of the application stack YAML?

When we think about other stacks, such as the sharded NoSQL stacks, Hadoop, and Spark, elasticitity will become a desired feature. For example, Qubole abstracts the Hadoop blob away from the user and uses elasticity to deploy the 'right' amount of infrastructure. Such elasticity will likely have to use the APIs of the cloud provider, but it would be interesting if the ferry application stack YAML could capture this.

For MPI applications, size of the cluster will be a configuration parameter, possibly a command line argument. How all the compute, storage, and networks are orchestrated is where I draw a blank in the division of labor between ferry's YAML and orchestration platforms like AWS CloudFormation and OpenStack Heat/Ceilometer.

jhorey commented 10 years ago

Better documentation surround the application YAML file is top on my todo list. The easiest way to think of Ferry with respect to CloudFormation and Heat, is that Ferry dynamically generates CF templates and uses CF to instantiate the physical infrastructure. That's an implementation detail, however. In theory you should be able to use Ferry without ever having to think about CF.

--James

On Wed, Oct 22, 2014 at 3:18 PM, Theodore Omtzigt notifications@github.com wrote:

With this answer I realize that I don't understand the scope and capability of the ferry orchestration language. Is there a pointer to some blueprints/documents/scribbles of the scope of the application stack YAML?

When we think about other stacks, such as the sharded NoSQL stacks, Hadoop, and Spark, elasticitity will become a desired feature. For example, Qubole abstracts the Hadoop blob away from the user and uses elasticity to deploy the 'right' amount of infrastructure. Such elasticity will likely have to use the APIs of the cloud provider, but it would be interesting if the ferry application stack YAML could capture this.

For MPI applications, size of the cluster will be a configuration parameter, possibly a command line argument. How all the compute, storage, and networks are orchestrated is where I draw a blank in the division of labor between ferry's YAML and orchestration platforms like AWS CloudFormation and OpenStack Heat/Ceilometer.

— Reply to this email directly or view it on GitHub https://github.com/opencore/ferry/issues/9#issuecomment-60139631.

Ravenwater commented 10 years ago

There is a danger in too much overlap, or YAW (Yet Another Way), to define infrastructure. What attracts me to ferry as a concept is that it can encapsulate the best known methods for orchestrating big data and computational science infrastructure. This involves identifying how compute, network, and storage play together to create a productive, high performance, or cost effective Hadoop, Cassandra, MPI, etc. infrastructure on which we can deploy applications. Ferry can differentiate against CF and Heat in that department very well, as CF and Heat are 'generic' infrastructure description languages, but Ferry YAML would be computational science/data science 'optimized'.

The intercepts of compute, network, and storage are, IMHO the hard part of leveraging containers, so if all that 'knowledge' can be encapsulated by ferry, I would be very happy.

Let me know if I can help writing/completing that documentation.

km4rcus commented 10 years ago

Regarding MPI, I think that at least a parameter to specify a customized image (with libraries/application) is necessary. As already pointed out, also having the possiblity to use different MPI flavours would be great; this can be achieved maybe by using several base images related to different MPI implementations and then use the configuration file to specify the MPI flavour and the customized image (with libraries/application).

km4rcus commented 10 years ago

What about providing dockerfiles to ferry in order to build customized images?

jhorey commented 10 years ago

You can already build and specify customized images to Ferry via the ferry build command. However, it's only limited to connectors at the moment. My inclination right now is to use the image option in the application YAML file.

--James

On Thu, Oct 23, 2014 at 6:43 AM, Marco Mancini notifications@github.com wrote:

What about providing dockerfiles to ferry in order to build customized images?

— Reply to this email directly or view it on GitHub https://github.com/opencore/ferry/issues/9#issuecomment-60221208.