jozu-ai / kitops

An open source DevOps tool for packaging and versioning AI/ML models, datasets, code, and configuration into an OCI artifact.
https://KitOps.ml
Apache License 2.0
521 stars 57 forks source link

Consider QoL improvements to tar artifacts generated by kit #489

Open amisevsk opened 2 months ago

amisevsk commented 2 months ago

Describe the problem you're trying to solve When packing a modelkit, each layer consists of a tar of the files in that layer's directory. For example, packing a model layer

model:
  path: model-files/

will result in a plain tar containing the contents of model-files (but not the directory itself). During unpacking, this context is reconstructed from the Kitfile so that digests are unchanged.

In addition, creating reproducible modelkits (unpacking and repacking results in the same digest) requires some additional changes, such as setting owner/group, timestamps, etc. to a known value.

This means that the tar files generated by Kit lose some information from the packing environment -- tooling using modelkits depends on kit-specific logic for extracting the contents of a modelkit.

Describe the solution you'd like To make layer tars more useful outside of a kit context, it would be useful to find a way to include more of the context from the packing environment (e.g. directory names).

Additional context Considering this change, it's likely something we want to decide sooner rather than later; any change to the pack process will break reproducibility of repacking unpacked modelkits. We also need to consider compatibility, so that previously-packed modelkits can still be unpacked with whatever new format we choose.

bmicklea commented 1 month ago

If implemented these will be breaking changes so better to do sooner than later. We'll take it in the sprint to discuss how to handle and (maybe) to implement.

bmicklea commented 1 month ago

Sprint 17 goal: make the decision on what we are going to do for this.

Implement the change can follow in a future sprint if needed.

amisevsk commented 3 weeks ago

I've given this some thought and I'm generally focussed on two general areas of improvement: 1) the tar artifacts generated for ModelKit layers, and 2) extensions to our config object that make it somewhat easier to consume Modelkits. I'll list the high-level changes here, with more detail below

  1. Pack layer tarballs to include the full directory structure from the Kit pack command's context (instead of relative to the layer's path)
  2. Update the default owner ID inside tarballs to user 1000 instead of 0
  3. Extend the manifest config structure to include digests and potentially a diffID equivalent
  4. Add an explicit layer for the plain Kitfile, after making the changes to the config object (add a new mediaType application/vnd.kitops.modelkit.kitfile.v1+yaml)

All of these changes are "nice to haves" rather than strictly necessary, so we may not want to ultimately bother with them at this time.


Update how we structure tar files for ModelKit layers

These changes largely center around making the individual layers of a ModelKit easier to use in non-ModelKit contexts (such as injecting ModelKits into containers as we do on jozu.ml).

Currently, we pack data into tar files by effectively switching context into the layer's directory and copying its contents into a tarball. This means that if you pack, for example, ./data/my-modelkit/my-modelkit.safetensors, files inside the tar will have relative paths that exclude data/my-modelkit (packing a tarball with just the file ./my-modelkit.safetensors). Kit then needs to re-construct the directory structure when unpacking to reproduce the original project. As a result, if you just extract the tar files used for layers, all of the directory structure is lost. Instead, I would like Kit to do less magic here, and have the tarred layers extract to the same directory structure as in the original context, packing ./data/my-modelkit/my-modelkit.safetensors instead of just my-modelkit.safetensors.

Additionally, in order to ensure re-packing the same directory results in the same digest, we overwrite the owner's user ID on all files to be the root user. This can cause issues in handling the tars directly, since the current user may not have access to the files; instead I'd like to default to user 1000, which is at least a non-privileged user.

Extend the ModelKit config object

Currently, the application/vnd.kitops.modelkit.config.v1+json media type is just a JSON-encoded Kitfile. This makes retrieving the Kitfile easy for any given ModelKit, but also means tools attempting to process ModelKit manifests need to implement the same logic as the Kit CLI -- for example, given a Kitfile

manifestVersion: 1.0.0
package:
  name: my-modelkit
model:
  name: my-modelkit
  path: my-modelkit.gguf
code:
  - path: LICENSE
    description: License file.
  - path: README.md
    description: Readme file.

and manifest

{
  "schemaVersion": 2,
  "config": {
    "mediaType": "application/vnd.kitops.modelkit.config.v1+json",
    "digest": "sha256:b19e288cbb07d2bd79e666cbbcc31b53d521f012a175fdfc54da52814c673dbd",
    "size": 426
  },
  "layers": [
    {
      "mediaType": "application/vnd.kitops.modelkit.model.v1.tar",
      "digest": "sha256:649716826e7381c1dd4f7909121ddd2b3581f5ebe793a7d27ab5cb65151f32c4",
      "size": 531067392
    },
    {
      "mediaType": "application/vnd.kitops.modelkit.code.v1.tar",
      "digest": "sha256:42df82c72f37f856070f346b9d1ff8b25154a2a2dd9eb597a1ac8ab297486e47",
      "size": 13312
    },
    {
      "mediaType": "application/vnd.kitops.modelkit.code.v1.tar",
      "digest": "sha256:8fb7ccbb0d326272ae0b375bab6f5c31da5a8748f32fa769936bc674fd634140",
      "size": 6656
    }
  ],
}

The only way to grab the README.md layer is to match its index in the Kitfile (the 2nd code layer) with its index in the manifest (the second digest with mediaType application/vnd.kitops.modelkit.code.v1.tar). This trips up both tools that want to extract metadata from ModelKits and also contributors to this repository.

Instead, we could have an "enhanced" Kitfile for our config -- for example

manifestVersion: 1.0.0
package:
  name: my-modelkit
model:
  name: my-modelkit
  path: my-modelkit.gguf
  digest: sha256:649716826e7381c1dd4f7909121ddd2b3581f5ebe793a7d27ab5cb65151f32c4
code:
  - path: LICENSE
    description: License file.
    digest: sha256:42df82c72f37f856070f346b9d1ff8b25154a2a2dd9eb597a1ac8ab297486e47
  - path: README.md
    description: Readme file.
    digest: sha256:8fb7ccbb0d326272ae0b375bab6f5c31da5a8748f32fa769936bc674fd634140

These extensions would be compatible with the Kitfile definition, so you could just ignore the new fields to get the original Kitfile, but I'd also like to just include the full YAML kitfile inside the manifest as a regular layer, to make it easy to get it explicitly.

gorkem commented 2 weeks ago

I agree with these proposed changes for structuring tar files for ModelKit layers. Preserving the original directory structure in the tar file will simplify things. The current structure forces too much magic we need to implement.

Switching the file owner to user 1000 is also a practical choice.

When we make these changes, we need to ensure that kit CLI can handle the current tar structure.

gorkem commented 2 weeks ago

Also +1 for the config object change and the kitfile as a regular layer