Hi, here is an initial proposition on how to enhance Lago's handling of images. As far as terminology we use today, I used here images on what we would call templates, I think its more clear. This proposition does not depend on, but is complete with the general concept of layered images in https://github.com/lago-project/lago/issues/51. thanks @gbenhaim, @ifireball for the contributions.

ImageProvider API

Abstract

The idea of this proposal is to allow Lago to use different remote servers in order to obtain images from(mainly in qcow2 format), with the following goals in mind:

It should be easy to use and configure different image providers and fairly easy to implement support for new ones. Current providers under consideration are: virt-builder index format and Glance (additionally to the already-used lago-images format).
This support should be, as much as possible, independent from the server-side implementation, in order to allow using public and existing image repositories(such as libguestfs.org). This in turn will allow Lago users, to use it without the pre-condition of using lago-images as a provider.
Any change should not invalidate(at least in the close time) images already installed by Lago users or introduce any backward-incompatible issues.
Verify images in the common practice each remote server's API supplies.
In the long run, this should allow to drop maintenance from lago-images.
Reuse most of the existing code.

Current Templates(images) mechanism in Lago today

TemplateStore - in charge of storing local images in the following format::

   repo_name:image:version -> qcow2(usually) file
   repo_name:image:version.metadata -> metadata file
   repo_name:image:version.users -> tracks usage
   repo_name:image:verson.tmp -> uses to store the file.

TemplateRepository - manages lago-images repository and has preparation for extending it to different providers.
HttpTemplateProvider/FileSystemProvider - providers for downloading the images.

Suggested ImageProvider API

The ImageProvider API would be used to search, query images and to download them from a remote server into the local cache. New providers will be obligated to implement it.

API

get(hash, path) - download the image, identified by the hash, to the specified local path.
get_by_name(string) - returns the first hash found for the given name, if it exists.
exists(hash) - returns true iff the image exists.
get_metadata(hash) - returns the image metadata stored at the server.
search(name) - this will return a list of (image-name, hash, metadata) matching the name criteria, in 'grep' like behaviour.
(optional) clear_cache - invalidate local cache, if exists.

Local index caching(i.e. mapping between names and hashes) for each ImageProvider will be optional. If implemented it will allow for some providers to optimize the seek time.

Decompressing and verifying the image file will be an implementation detail of each provider.

Handling images with the new ImageProvider API

Local images Caching

By default image caches will be stored globally, however(as possible today) the user will be able to override the directory and decide on local caching.
Inside the image cache directory(be it global or local) the image filename will be saved with its SHA512 as a filename.
A local index file will be maintained, which might be virt-builder index file format(or a different more optimized hash -> (file, metadata) structure).

Seeking images

If the search key is by hash(for instance, when resolving layered images or if specifically given by the user): check the cache, then go to external providers.
If the search key is by name: iterate over the ImageProviders, then if exists, check cache, if not download. If there is no internet connection, the local cache will be query directly, by maintaining a reverse lookup table to the cache(name->HASH).

This will allow to have one-to-many mappings between the image names and hashes, so if a new image with the same name was pushed to the server, it will be downloaded if searched by name(semi-automatic updates behaviour).

Local cache invalidation

As long as the user didn't ask explicitly to modify, or alter the images, the cache will not be removed.
A flag would be added to allow invalidating local image, or to ignore the existing images.

New `image` verbs

lago images pull <name/hash>, alias: lago pull: download the given image to the local cache. If it has a backing_file chain, ensure they exist in the cache or download them.
lago images pull --no-recursive <name/hash>: skip resolving the backing_file parameter if exists.
lago images search, alias: lago search: search image by hash or by name in the configured ImageProviders, also indicating if it exists locally already.

Configuring ImageProviders

As in today Lago will come with pre-configured with a default external provider.
Other ImageProviders would be configurable via the configuration file, or CLI.

HASH vs Versioning

The images 'versioning' concept will be dropped, instead the name->HASH mapping will be used as described above. This has the advantage of uniquely identifying images, regardless of what provider they were downloaded from.

Main advantages

Once this and https://github.com/lago-project/lago/issues/51 is implemented, it would be possible to compose image chains from different servers. For Example: A. The base-img will be downloaded from a well-known official server. B. The next layer would be downloaded from a local server.
Lago will be decoupled from lago-images, thus allowing users to configure Lago to use widely-used providers, or setup their own providers easily.
There will be a clear and consistent way to: search, download and remove images.

Possible implementation stages

Create ImageProvider API and wrap lago-images as the first ImageProvider plugin. No major changes to TemplateStore.
Create virt-builder ImageProvider plugin.
Implement the image cache in the new format, excepting the images the user has in the old format.
Expose the new verbs in the CLI(in stage 1 - lago init would already start using the internal implementation).
Optional - Implement Glance ImageProvider.

Some comments:

In abstract section 4 - "verify images" - should probably be "verify images data integrity" to be clear about what we mean by "verify".
I think ImageProvider.get should be called ImageProvider.download to indicate it is going to do the expensive I/O. Calling it "get" sends the wrong signal IMO that it is something like dict.get.
I think ImageProvider.search should not be mandatory at this point. And when we do add it we need to think carefully about the query format so we can leverage server-side capabilities. It is highly unlikely that servers will implement the exact Python or Grep (POSIX? PCRE? GNU?) regex dialect.
ImageProvider.exists should probably just be ImageProvider.containes to allow usage of the "in" operator.
We need to specify what is in the image metadate. Leaving it unspecified makes it useless.
We need to carefully consider and specify the hash format. Some considerations:
- virt-builder supports only SHA512
- Glance supports only MD5 by default (but custom properties could be added)
- Re-hasing is very expensive I/O wise
- We should be efficient when someone only uses one kind of a provider (So not try to rehash everything with a specific algorithm)
Bottom line - we should probably include the algorithm in the hash string. This also has implications for #51.
In the "Local Image Caching" we need to specify the exact cache API like we did for ImageProvider. Also section 3 is IMO just an implementation detail.
I don't think lago images pull is useful or needed - the user does not need to handle images manually IMO - just list them in the LagoInitFile.
I think lago image search will need careful consideration and we should not try to implement it at this stage.
Section no 3 of "Main advantages" is not true IMO - Image downloading is not something Lago users want or need to to (Lago does this transparently). Image removal is not included here, and again is not something a Lago user should care about often. Image search may be useful for environment builders (Which are some of the users but not all of them), but I don't think what is described here is necessary or sufficient to enable it.

I have a couple questions:

How would you generate the images on the server side? I'd love to be able to build them locally too (that will unify the code and simplify the image building process, and allow custom local build if needed, something similar to docker and dockerfiles). My original idea was that the image command was able to generate images from image recipes, locally (at some point, maybe even be able to upload somwhere the recipes/images, though that's a completely different service, like dockerhub or vargant repos).
- If going for something similar to the above, I think that the image recipes are a must (that allows generating images from code, and git version and such).
The content for an image changes each time you generate it, so the hashes will not be consistent between servers/rebuilds right?
- To overcome this issue, I think that we should use the hash of the recipe + the hash of the image (with lower priority).
- But still, this does not solve the ordering issue, as hashes are not consecutive, thus imo we still need some kind of versioning. My proposal is to use the version of the parent recipes + version of the current recipe for it. For example, you'd have the image with the version string 1.0-2.6-4.1-abcdef.fedcba, that would mean that you used base recipe 1.0, second recipe version 2.6 and third recipe version 4.1 (with hash abcdef), and generated an image with hash fedcba. That ensures:
- You can pin a version to an image file
- You know which versions of the recipes were used
- You know the amount of recipes needed
- You maintain versioning order, you will know that 1.0-2.6-4.3 is newer than 0.2-2.6-4.3.
- You can use kind-of semantic versioning, where if none of the major numbers changes, you'd expect the images to be backwards compatible, so when specifying an image requirement, you could say something like nfs-server~1.0-2.0-4.3 and get the latest that matches the three major versions.
  - That will allow easy and nicer dependency declarations between them.
I also think that the metadata should not be in the image itself unless you can put any data into it, if that's the case then it's ok, but I don't think it's a good idea reusing any existing field giving it a new meaning.

The main issue with the above, is that it will not allow you to directly use gluster or virt-builder repos, though you could have some thin repo proxy to add the extra metadata (and map from a 'lago version' to a gluster image).

@ifireball

In abstract section 4 - "verify images" - should probably be "verify images data integrity" to be clear about what we mean by "verify". I think ImageProvider.get should be called ImageProvider.download to indicate it is going to do the expensive I/O. Calling it "get" sends the wrong signal IMO that it is something like dict.get. I think ImageProvider.search should not be mandatory at this point. And when we do add it we need to think carefully about the query format so we can leverage server-side capabilities. It is highly unlikely that servers will implement the exact Python or Grep (POSIX? PCRE? GNU?) regex dialect. ImageProvider.exists should probably just be ImageProvider.containes to allow usage of the "in" operator.

:+1: - I agree with most of the above. Probably should have written it explicity but I wrote the API section without the exact implementation details yet, it still needs some polishing. About the search - I also think this can be added later, though having an initial version won't hurt. As a first thought it looks problematic to try defining a unique search query for all providers as it might be very different from provider to provider. IMO, a plain search which compares if the image's name match at the beginning is sufficient(and useful!), later we can think about more complex queries if needed at all(filter by architecture, os, etc).

We need to specify what is in the image metadate. Leaving it unspecified makes it useless.

true - but I rather have this done in the PRs(and review), will need to get my "hands dirty" and test removing all metadata and check what is absolutely necessary(except the HASH). I already tested plain cloud images of fc24 and centos7 and they work without metadata at all.

We need to carefully consider and specify the hash format. Some considerations: virt-builder supports only SHA512 Glance supports only MD5 by default (but custom properties could be added) Re-hasing is very expensive I/O wise We should be efficient when someone only uses one kind of a provider (So not try to rehash everything with a specific algorithm)

I don't think there is any way to avoid rehashing if it is in different format. Thought virt-builder works with SHA512 and that is the first and main provider we are going to implement that, so I rather have that.

Bottom line - we should probably include the algorithm in the hash string. This also has implications for #51.

agree. @gbenhaim - what do you think? (affects #51)

In the "Local Image Caching" we need to specify the exact cache API like we did for ImageProvider.

:+1: - will do. needs more inspection.

I don't think lago images pull is useful or needed - the user does not need to handle images manually IMO - just list them in the LagoInitFile.

I think it is, I would love(as a user) to have the ability to pre-fetch images. Of course the automatic action would still be to pull the images from the init file, it doesn't contradict(and it is just a matter of exposing the internal download command as a CLI).

@david-caro

How would you generate the images on the server side? I'd love to be able to build them locally too (that will unify the code and simplify the image building process, and allow custom local build if needed, something similar to docker and dockerfiles). My original idea was that the image command was able to generate images from image recipes, locally (at some point, maybe even be able to upload somwhere the recipes/images, though that's a completely different service, like dockerhub or vargant repos).

One thought is to create something similar to createrepo(maybe createvirtbuilder-index?) that would auto-generate the index.asc file, it would extract all it can from the qcow2 file and calculate the hash if needed. Only thing it would need to somehow be 'provided' with is the following virt-builder fields: osinfo,arch,expand, with the 'expand' filed optional.

The content for an image changes each time you generate it, so the hashes will not be consistent between servers/rebuilds right?

Not sure why is that a problem? Lets say I'm a maintainer of an images repo, and I would like to add a new image, assume it is a new build of feodra24, I have two options: replace the current image, keeping it with the same name, and obviously it will have a new hash. If the user has the init file configured 'fedora24', he will get a rolling update. I can rename the old image to something else if I would like to keep it(and it would obviously maintain the old hash). on the other hand, if I want to explicitly differentiate my image, I would name it fedora24-something, and the user will have to ask for it explicitly. In this sense the image 'name' is just a tag in the repo for the hash, nothing more. There could be same tags pointing to different image hashes in different providers. It is up to the maintainer of the images repo to decide. On lago side, the verification eventually would always be done by the hash.

But still, this does not solve the ordering issue, as hashes are not consecutive, thus imo we still need some kind of versioning. My proposal is to use the version of the parent recipes + version of the current recipe for it. For example, you'd have the image with the version string 1.0-2.6-4.1-abcdef.fedcba, that would mean that you used base recipe 1.0, second recipe version 2.6 and third recipe version 4.1 (with hash abcdef), and generated an image with hash fedcba. That ensures: You can pin a version to an image file You know which versions of the recipes were used You know the amount of recipes needed You maintain versioning order, you will know that 1.0-2.6-4.3 is newer than 0.2-2.6-4.3. You can use kind-of semantic versioning, where if none of the major numbers changes, you'd expect the images to be backwards compatible, so when specifying an image requirement, you could say something like nfs-server~1.0-2.0-4.3 and get the latest that matches the three major versions. That will allow easy and nicer dependency declarations between them.

I'm not convinced it is absolutely necessary to have versions: I think for common usage case, the 'base' images shouldn't change often: most likely you will use virt-builder's official repo and use either 'fedora24'/'centos7'. Once you get into layers, this will be more of a specific use-case, so it is reasonable that the maintainer of the layer would ask the users either to explicitly use the 'tag' he created, such as: el7-with-jenkins-2.6-10102016 or tell them to explicitly use the hash.

About the recipes, similarly to how you do versioning in docker: it will be controlled by making versions of the LagoInitFile. This might mean we will need to add more 'virt-sysprep' options to the init file, on how to "chew" the image(such as disabling cloud-init or adding support for it). As I wrote in the previous comment, I don't think we are far from there - the basic cloud images of fedora and centos just work in lago with the current sysprep commands(aside booting is longer as it waits for cloud-init).

I also think that the metadata should not be in the image itself unless you can put any data into it, if that's the case then it's ok, but I don't think it's a good idea reusing any existing field giving it a new meaning.

:+1: I checked in qemu-img and there is no "official" way to store more metadata in the 'qcow2' format in a way it would appear in qemu-img info command. Only parameter we must use is the backing_file parameter for the parent hash. Either way, it seems inevitable to store the metadata for each image(we can use virt-builder index.asc for that locally too, the problem is that it wouldn't allow efficient querying of the cache directly).

About the createrepo command, that is alreqdy done in the lago-images code. The extra information is in the recipes (with any/all the commands needed, including the info of which image to base it onto).

Aboit the changing hash, it might not be a big issue, as long as you have one and only one inage provider, including not building the images locally. Maybe it's something to be rechecked once is an issue, though it being kind of a central feature it will ne hard to change later.

About versions, as with the hashes, it's something that we don't need now (as we are the only ones using it, and hardcodding the versions/names on our scripts). But, and it's a big but, using a sensible versioning enables us to start really distributing the images, allows us to define if an upgrade is 'safe' or not, and any other benefit that you get with the 'tags' you mention (that, as I see it, is a pseudo-versioning).

About the recipes, I think you misunderstood me, I'm talking about the recipes to generate the images, not the LagoInitFile that uses them. For example, to generate the image fedora24_nfs, you need a fedora24 base image, and run some commands to install and configure the nfs server there, all that might not be on the LagoInitFile as it will considerably increase the time to get it the first time, but would preferably be done on the server once at a previous time so you just have to download the image. Those 'recipes' already must contain the info on what is the parent image, and well, I'd expect the versions of them to be the same of the image one (or to have some way to know the recipe used from the image).

That elegantly goes very well with versioning the images after the recipes (well, including the recipes version into the images one).

About the metadata in the image file, well, you might strictly need only the backing file hash (though you could put it anywhere else too, like the url, the name of the image...) But there is a lot more useful information that you are missing, and that by choosing to put it in the restricted metadata field you will not be able to add it easily later, without having a rewrite of the code that extracts/adds/uses that hash. For example, the default iser, the root partition, default password, os name, who built it, when, how, a small description...

For the current repos we use the json file that contains that metadata, I think that it's very easy to cache locally or by a proxy, and mirror (plain rsync is more than enough). Not needing a complex app to host a basic repo is really nice too, with metadata files you don't even need an app, you can use a read-only frontend and generate the images internally, quite safe (and fast to serve).

lago-project / lago

Add ImageProviders concept #359

ImageProvider API

Abstract

Current Templates(images) mechanism in Lago today

Suggested ImageProvider API

Handling images with the new ImageProvider API

Local images Caching

Seeking images

Local cache invalidation

New `image` verbs

Configuring ImageProviders

HASH vs Versioning

Main advantages

Possible implementation stages

lago-project / lago

Add ImageProviders concept #359

ImageProvider API

Abstract

Current Templates(images) mechanism in Lago today

Suggested ImageProvider API

Handling images with the new ImageProvider API

Local images Caching

Seeking images

Local cache invalidation

New image verbs

Configuring ImageProviders

HASH vs Versioning

Main advantages

Possible implementation stages

New `image` verbs