harness / gitness

Gitness is an Open Source developer platform with Source Control management, Continuous Integration and Continuous Delivery.
https://gitness.com
Apache License 2.0
32.07k stars 2.79k forks source link

Cache syntactic sugar #2060

Open bradrydzewski opened 7 years ago

bradrydzewski commented 7 years ago

Now that we have a standard for cache plugins (thanks @donny-dont and @jmccann) I think we can add some syntactic sugar to the yaml to simplify cache setup.

Proposed syntax:

+cache:
+  - node_modules

pipeline:
   build:
     image: node
     commands:
       - npm install
       - npm run tests

The above syntax would be used by the yaml parser and compiler to automatically append steps to restore and rebuild the cache:

pipeline:
+ restore_cache:
+   image: plugins/s3-cache
+   restore: true

   build:
     image: node
     commands:
       - npm install
       - npm run tests

+ rebuild_cache:
+   image: plugins/s3-cache
+   rebuild: true    
+   when:
+     event: [ push ]

We might also want to borrow from the gitlab syntax allowing the user to specify a caching key:

cache:
  key: ${DRONE_COMMIT_REF}
  paths:
  - node_modules/

We also probably need an option to support temporary volumes when using the Docker runner or the Kubernetes runner:

cache:
  key: ${DRONE_COMMIT_REF}
  paths:
  - node_modules/ # folder inside workspace
  - /cache/gems/  # folder inside temporary volume
  volumes:
  - name: cache
    path: /cache

When installing drone you can configure which plugin should be used by default as well as the default settings (such as s3 credentials).

If someone doesn't like the way the default cache is implemented they will always have the ability to define the individual cache steps directly in the pipeline and have full control over the configuration and behavior. I feel like this is a best of both worlds approach. We get the simplicity of caching in 0.4 with the flexibility and stability of cache plugins in 0.5+

With this feature I will also build an official volume cache plugin using the drone-cache-lib for teams that need the performance of local caching and simply cannot use remote caching, since this has been requested by a few teams.

donny-dont commented 7 years ago

I can really sympathize with wanting caching to be easier. It is also A LOT more verbose than the initial implementation. The thing I want to mention is that its really really easy to do caching poorly.

Problems with using a volume

  1. It doesn't scale out. Each host has its own volume so as you add more hosts then that volume gets stale.
  2. Potential collisions with multiple builds running on the same host. One build is taring the files and another is doing the same.
  3. Clearing the volume is harder. That's one reason to have the flush but also not every org would allow you to have direct access to the build hosts which can mean they end up eating all the disk.

Problems with caching in general

  1. Its not clear what you SHOULD be caching. This is especially true with node and its many different package managers.
  2. Its not clear if compression should be used. Probably fine to not use it with volumes but for s3 a dev was actually finding that compression was better.
  3. Caching can fail a build if it is not successful.

Other questions.

  1. What happens with matrix builds?
  2. What about clearing the cache?
  3. What type of archive is being created?

Caching is complicated and I really think this is YAML plugin territory where you can essentially have full control over the important bits.

bradrydzewski commented 7 years ago

Thanks for the reply. I should probably clarify that plugins aren't going away and volume caching would not be default. When installing drone you would have the option to setup default cache configuration values.

This could look something like this:

DRONE_CACHE_S3=true
DRONE_CACHE_S3_ACCESS_KEY=..
DRONE_CACHE_S3_SECRET_KEY=...
DRONE_CACHE_S3_IMAGE=jmccann/s3-cache # this would be implied

The cache block (below) would use the global plugin and global configuration (above). When using the cache block you would only have the option to specify files and folders that need caching.

cache:
  - node_modules

Problems with using a volume

I don't want to focus too much on the volume plugin because it would be just one of many plugins you could choose from. I will say that I think there are use cases where it makes sense, for example running drone on a single machine.

Potential collisions with multiple builds running on the same host. One build is taring the files and another is doing the same.

I would probably look at implementing a lock using https://github.com/pkg/singlefile

Its not clear if compression should be used. Probably fine to not use it with volumes but for s3 a dev was actually finding that compression was better.

I would expect that when the system admin chooses the default plugin they have the ability to set default options including compression level, compression algorithm, and other parameters exposed by the underlying plugin.

What about clearing the cache?

I would propose that compatible plugins must based on drone-cache-lib and have some default policy in place for flushing the cache. I'm not sure how well the flush code works in practice (too aggressive, not aggressive enough) so we will need to evaluate.

Caching can fail a build if it is not successful.

This is definitely something we will need to address. It was perhaps the biggest issue we saw with the origin volume cache approach in 0.4, next to running out of space

Caching is complicated and I really think this is YAML plugin territory where you can essentially have full control over the important bits.

I agree. This is just syntactic sugar for configuring cache plugins in the pipeline. I would expect that if a projects needs full control they configure the cache plugins directly in the pipeline section instead of using the (proposed) cache section.

bradrydzewski commented 7 years ago

Perhaps another, more generic, way of solving this problem would be supporting the extends keyword in docker compose. It could be use to apply default configurations.

pipeline:
  restore:
    extends: restore-cache
    mount: [ node_modules ]

  rebuild:
    extends: rebuild-cache
    mount: [ node_modules ]

This would not be specific to caching and could be used for any plugin or build step.

bradrydzewski commented 7 years ago

Also as a side note, we should look into using this for the S3 cache plugin :) https://aws.amazon.com/about-aws/whats-new/2011/12/27/amazon-s3-announces-object-expiration/

jhasse commented 7 years ago

Having only to specify which directories to cache once, is a big step forward! :)

I'd also love if caching was simple a bunch of commands after which the docker diff will be cached, I don't know if that's even possible. E.g.:

cache:
- npm install # creates node_modules

The image in which the command should run would need to be specified though. This would allow caching of directories outside of the working directory, e.g.

cache:
- dnf install -y foo-devel

Maybe this is a totally different issue and should rather be called setup: or something like this, and can be added to any pipeline step.

bradrydzewski commented 7 years ago

@jhasse caching files and folders outside the workspace will work once we support the compose volume system. It would look something like this:

pipeline:
  build:
    image: maven2
    commands:
      - mvn ...
      - mvn ...
+   volumes:
+     - my-data-volume:/usr/share/maven2
  deploy:
    image: maven2
      - mvn ...
      - mvn ...
+   volumes:
+     - my-data-volume:/usr/share/maven2

+volumes:
+  my-data-volume:

The diff idea definitely sounds interesting but would be a major change and does not fit nicely into our current plugin design. So for 1.0 we will continue down the current path but try to improve the documentation, simplify the configuration and increase the number of high quality plugins.

thomasf commented 7 years ago

When I used the old (0.4?) cache the performance benefits of using it at all was often negated by the time that was used to extract and recreate the cache archives. This is at least to some extend related to the fact that our test runner server doesn't use SSDs but IIRC the compression did use more than a single CPU core (on a 24 core machine) which probably added a lot of time to the process for multi-GB caches.

In a scenario with multi host agents the cache folder can in many cases use a shared network mounted file system..

An additional issue on top of scale out with host local caches is that different nodes will have slightly different caches which can lead to annoying debugging problems when builds might differ slightly from host to host.

tonglil commented 7 years ago

To chime in on the default cache plugin, would shipping a default Minio server help? This would direct users away from mounting anything from the host, while keeping performance relatively reasonable as Minio would sit (I would imagine) fairly close to Drone, and provides an easy migration path to S3/GCS.

tboerger commented 7 years ago

To chime in on the default cache plugin, would shipping a default Minio server help? This would direct users away from mounting anything from the host, while keeping performance relatively reasonable as Minio would sit (I would imagine) fairly close to Drone, and provides an easy migration path to S3/GCS.

The current s3-cache plugin doesn't work for Docker layer caching. But beside that I'm already using Minio pretty active for Drone cache.

anasinnyk commented 6 years ago

@bradrydzewski It will be awesome if we can create cache by file checksum (for example for glide.lock file) and cache directory by git revision. And will be cool if old cache version can be removed automatically or save only configured numbers cache keys (for example save last 5 cache bucket and remove older if we try create new one with some tag).

bradrydzewski commented 5 years ago

@kakkoyun wondering if you could chime in. The goal would be to have cache syntax built-into the yaml but under the covers use caching plugins. I want to make sure our proposed syntax would support the cache plugin you recently blogged about.

See my full comment here.

The basic idea is that we would add this:

+cache:
+  key: ${DRONE_COMMIT_REF}
+  paths:
+  - node_modules/

steps:
  - name: build
    image: node
    commands:
    - npm install
    - npm run tests
kakkoyun commented 5 years ago

Hey @bradrydzewski, this looks great. And thanks a lot for inviting me to chime in. I would love to see this in our configuration files. This would slim down our configuration files significantly. Great job, you guys are doing here. That being said, I have a couple of questions and suggestions.

First of all, how do you plan the pass on additional configuration parameters to the plugins? Through environment variables (assuming with a prefix per env vars, like CACHE_*) or through settings like in step configuration?

Another question would be, do you have an interface/contract in mind for a minimum set of required configuration keys in mind? (probably restore, rebuild, and key would be there)

And as a suggestion, I would love to have an ability to specify cache per step. One could have different tools as steps, in a single pipeline run which can generate various directories. I hope since this is just syntactic sugar this shouldn't be too hard to implement.

steps:
  - name: build
    image: node
    commands:
    - npm install
    - npm run tests
   +cache:
+    key: {checksum package.json}
+    paths:
+    - node_modules/

  - name: generate
    image: node
    commands:
    - npm install
    - npm generate
   +cache:
+    key: {checksum generation_logic.json}
+    paths:
+    - generated/

What do you think? Would your users benefit from such implementation?

bradrydzewski commented 4 years ago

@kakkoyun sorry for the delay in my response. I am interested in your previous example and was wondering how you decide whether or not to rebuild vs restore. Can you elaborate a bit?

bradrydzewski commented 4 years ago

I wanted to provide an update. I have been working to design a caching system that will work with container-based pipelines and non-container-based pipelines (e.g. exec pipelines). These are just some draft notes and are not set in stone.

Container-based Pipelines

For container-based pipelines we would treat the cache as a volume. The system would snapshot and restore the volume at runtime. Individual pipeline steps would be able to mount the cache volume to the desired location.

volumes:
- name: gopath
  cache:
    checksum:
    - go.mod
    - go.sum
    ttl: 24h

steps:
- name: build
  image: golang
  volumes:
  - name: gopath
    path: /go/pkg
  commands:
  - go build
  - go test

The above syntax would be syntactic sugar for plugins. It would effectively add a step at the beginning of the pipeline to restore the cache, and a step at the end of the pipeline to update the cache. When you install the runner, you would configure the default caching plugin to use (e.g. meltwater/drone-cache). We would come up with some standard set of parameters to pass to the plugin at runtime.

It is important to remember that runners do not have guaranteed disk access and cannot be expected to read pipeline files. Pipeline steps do have guaranteed disk access. So we need to design a solution that is compatible with the system constraints, which is why this proposal is mostly syntactic sugar to add cache steps to the pipeline.

Non-Container-based Pipelines

For non-container-based pipelines (such as exec pipelines) we would take a slightly different approach and use different syntax. Remember that exec pipelines run directly on the host without isolation and there is no concept of volume mounts.

For non-container-based pipeline we would define a cache section:

cache:
- name: artifacts
  path: /path/to/file
  checksum:
  - go.mod
  - go.sum
  ttl: 24h

steps:
- name: build
  image: golang
  commands:
  - go build
  - go test

The cache would be downloaded at the beginning of the pipeline and snapshotted and uploaded on pipeline completion. Since there is no concept of plugins for exec pipelines, we would make caching pluggable through extensions (microservices, similar to secret extensions, etc).

An alternate syntax may separate inputs and outputs. We would restore inputs from the cache, and snapshot and upload the output directories to the cache on completion. The yaml would be more verbose but may prevent un-necessary cache operations.

inputs:
- name: artifacts
  path: /path/to/file

outputs:
- name: artifacts
  path: /path/to/file

Also note that, unlike container-based pipelines, the runner may have access to the filesystem to read and write files directly. The exec runner will definitely have filesystem access. The ssh runner has filesystem access via sftp which is perhaps less ideal since caching would require the runner to be a middleman between the cache and the pipeline instance.