substratusai / runbooks

Finetune LLMs on K8s by using Runbooks
https://www.substratus.ai
Other
168 stars 14 forks source link

Consolidate Substratus Image Repos #106

Closed nstogner closed 1 year ago

nstogner commented 1 year ago

Proposal: Consolidate all Substratus image repos for easier maintenance and easier search for users.

NOTE: In this implementation, perhaps the base-images are used for Models, Datasets, and Notebooks, and they contain notebook.sh script along with Jupyter pre-installed. This removes the need for a separate notebook image repo.

base-images.git:

  ubuntu/*    # Image: substratusai/os-ubuntu
  ...

dataset-images.git:

  squad/*     # Image: substratusai/dataset-squad
  ...

model-loader-images.git:

  huggingface/* # Image: substratusai/model-loader-huggingface
  ...

model-trainer-images.git:

  huggingface/* # Image: substratusai/model-trainer-huggingface
  ...

server-images.git:

  basaran/*   # Image: substratusai/server-basaran
  ...

Path based triggers could be used to filter Github actions build jobs. For example, when building the facon-7b model:

on:
  push:
    paths:
      - 'models/falcon-7b/*'
      - 'base/ubuntu/*'
nstogner commented 1 year ago

See the following implementations of consolidated repos:

https://github.com/substratusai/base-images https://github.com/substratusai/model-loader-images

If we like this direction, we can add to it, otherwise, we can remove.

brandonjbjelland commented 1 year ago

@samos123 and I discussed this for a few minutes and reached consensus on a monorepo including base images. Rationale:

  1. This is a reversible decision - we just need to decide and commit.
  2. monorepos are a good optimization for speed (reduce context switching and duplication) especially early on. Support here.
  3. It's contributor friendly - if a community member wants to add a model loader, dataset loader, or even has their own category of image that we hadn't thought of (model evaluator, quantizer, dataset generator, widget producer) the image monorepo is a natural place for even odd use cases (for now at least - there's a reasonably good case further down the road for community vs substratus supported images).
  4. build and testing becomes simpler when you have a single set of actions to work against. It's easier to ship as a single unit.
  5. limit or eliminates cross-repo references/indirection - answering the question "how did this base image get built?" is something a passerby can readily understand in a monorepo
brandonjbjelland commented 1 year ago

The implementation I have in mind:

  1. a repo named image-builders - unsure if we want to split early and call this official-image-builders or supported-image-builders, contrasting it with community-image-builders
  2. a path structure like you suggested only with a base dir reflecting purpose:
    bases/ubuntu/*    # Image: substratusai/os-ubuntu
    servers/basaran/*   # Image: substratusai/server-basaran
    model-loaders/huggingface/* # Image: substratusai/model-loader-huggingface
    model-trainers/huggingface/* # Image: substratusai/model-trainer-huggingface
    dataset-loaders/squad # from substratusai/dataset-squad

Another server emerges when we have santacoder and starcoder working and fine-tuned. Those servers need to implement the language server protocol spec to be used by lsp clients. Likewise, the text-generation-inference would make for a worthwhile server effort.

Questions:

  1. I dropped the redundant bits that the parent directory captures. e.g., became model-trainer/huggingface/* would be identical to the current contents of the repo substratusai/model-trainer-huggingface and would publish an image to dockerhub at substratusai/model-trainer-huggingface. Does this seem reasonable?
  2. Can the current dataset loader be generalized to all of huggingface non-token-protected datasets? If so, I think we have a pattern to follow here.
  3. we should circle back to the question of token-protected hf bits. I'll create an issue to discuss.
samos123 commented 1 year ago

I rather keep it simple without nested directories for container image directories:

substratus-images:
  base (helpful for HF loader which doesn't need many dependencies and keeps image small)
  base-gpu (maybe this is our only base)
  model-loader-huggingface # image substratusai/model-loader-huggingface
  model-trainer-huggingface
  dataset-loader-k8s-instruct
  dataset-loader-squad

Every directory under substratus-images should be a directory with a Dockerfile. By convention, the resulting image will be substratusai/{DIRECTORY_NAME}.

What you all think?

brandonjbjelland commented 1 year ago

☝️ That's perfectly good by me.

nstogner commented 1 year ago

I am good with the monorepo, I would name it github.com/substratusai/images.git. I prefer to use a single base image for gpu and non-gpu workloads. This might actually help speed things up b/c of base layer caching.

nstogner commented 1 year ago

Started on this instead of updating other repos in place

nstogner commented 1 year ago

NOTE: images.git is currently dependent upon https://github.com/substratusai/substratus/pull/109

samos123 commented 1 year ago

this is mostly done except using same base image for all. Lets track image specific issues in the image repo going forward. Closing the issue here.