jozu-ai / kitops

Tools for easing the handoff between AI/ML and App/SRE teams.
https://KitOps.ml
Apache License 2.0
267 stars 26 forks source link

Improve how artifacts are stored locally to avoid duplicating data #75

Open amisevsk opened 3 months ago

amisevsk commented 3 months ago

Describe the problem your feature would solve

Currently, ModelKits are stored using one OCI spec index per repository, using the folder structure

<storage-root>
└── <registry>
    └── <organization>
        ├── <repository1>
        │   ├── blobs
        │   ├── index.json
        │   └── oci-layout
        └── <repository2>
            ├── blobs
            ├── index.json
            └── oci-layout

As the OCI image index spec does not leave easy room for multiple repositories within one index, tagging the same image into two separate repositories currently uses double the storage. In other words, executing

kit tag my-image:mytag my-other-image:mytag

results in the blobs for my-image being copied to another directory.

Note this issue isn't present for ModelKits within the same repository -- i.e. my-image:tag1 and my-image:tag2 will share storage as expected.

Describe the solution you'd like

Since blobs are content-addressable and there are no auth concerns with locally-stored modelkits, it makes sense to store each blob only once, and reference them from multiple different indexes. This would cut down on storage requirements for ModelKits while keeping a relatively pure OCI image index structure.

Describe alternatives you've considered

Alternatively, we could abandon using the image index structure for local storage and instead implement an alternate way of tracking references to ModelKits in local storage. This would avoid the need for potentially awkward workarounds to manage accessing and removing blobs locally.

Additional context

bmicklea commented 2 months ago

I can see the potentially significant storage benefits to implementing this. Does it make Kit and ModelKits any easier to use?