`ilab` engine proposal - Githubissues

cdoern commented 7 months ago

This enhancement introduces a new design for ilab. Primarily adding sub-parent commands, new modes of interactions with ilab, and clarity on what the source + sink of data are in this system we are building

jeremyeder commented 7 months ago

overall, unless there are conflicting proposals to merge into this one, i think this is a meaningful and logical next step from the MVP capabilities and we should pursue it.

bbrowning commented 7 months ago

Has any thought been given to integrating more deeply with Open Container Initiative (OCI) here? For example, I use Kitops today to store the source taxonomy, generated synthetic data, and generated models I get from ilab directly in an OCI filesystem layout and push/pull to/from OCI registries (quay, docker hub, etc).

There's a lot of overlap between this proposal and what kitops handles today, that would be good to consider as there are good ideas in both. kit has a concept of a ModelKit, which is really just an OCI config that points to different OCI layers with differing semantic meanings and mediatypes. Under the covers it uses Oras to manage the metadata, source code, datasets, and/or generated models as OCI artifacts. Is there an opportunity to combine forces a bit here for a bigger community push around generated data and models as OCI artifacts?

Some examples:

kit pack packs my model (and other artifacts I may care about, like generated data and taxonomy) into a ModelKit, which is just an OCI artifact. What if the output of ilab model generate was a ModelKit with generated data? And the output of ilab model train was a ModelKit with the trained model?
kit tag tags my ModelKit, much like the proposed ilab tag
kit push pushes my the ModelKit from my local registry to any OCI registry
kit pull pulls the ModelKit from a remote OCI registry to my local registry
kit unpack extracts the source, dataset(s), or generated model from either a local or remote ModelKit. This works well when you just want to fetch part of the ModelKit (just the model itself, or perhaps just a specific dataset) as it only pulls the OCI layers needed for the type(s) you request.

If we go down the OCI route, then we open up a much wider world of registries we can push models and datasets to as many enterprises already have an OCI registry for container images and every cloud provider offers their own SaaS version. Additionally, we open up some potential deeper integration into tools like podman and docker as we consider how someone might want to construct a container image consisting of an inference serve base container image with their generated model layered on top.

I'm not sure kitops gets everything right, but given that OCI and Oras give us a lot of the proposed functionality as a default part of OCI (filesystem layout, push, pull, list, inspect, tag, etc) it seems reasonable to consider.

alimaredia commented 7 months ago

After really thinking through this proposal with @cdoern today, I think the part of this proposal that establishes a hierarchy of commands but does not add any new functionality should be accepted and work should be started as soon as possible.

The new command structure is an upgrade over the existing ilab command, especially for a first time users, and the new command structure give the flexibility to have ilab be an "engine" if we decide to do in future discussions.

Backward compatibility of existing commands should be kept for a pre-determined number of milestones or amount of time that is agreed upon in this enhancement.

instructlab / dev-docs

`ilab` engine proposal #9