instructlab / GPTDolomite

Apache License 2.0
1 stars 3 forks source link

Provide context for this repository #4

Open russellb opened 3 weeks ago

russellb commented 3 weeks ago

I can't find any explanation for why this repository was necessary. Maybe it's in a PR somewhere? The README points to the original repo, but doesn't explain why a fork is necessary.

The typical process for adopting a new repository under InstructLab is a proposal to the instructlab/dev-docs repository and getting it approved by the Oversight Committee. This shouldn't block any work, as all technical work could be done in a personal fork of the repo in the meantime.

Some comments in this issue with some background would help me provide guidance for what to do from here.

russellb commented 3 weeks ago

The output to resolve this issue I'd like to see at a minimum:

  1. an update to the README that explains the differences in this fork vs the upstream repo.
  2. what it would take for this fork to go away and move back to using the upstream repo
  3. some explanation of the process used to ensure we stay up to sync with the upstream repo where appropriate
mayank31398 commented 3 weeks ago
  1. Upstream repo has a training framework for both pretraining and finetuning
  2. Upstream repo supports a lot more model architectures and systems optimizations
  3. All of this is not needed for ILAB
  4. The current fork only has the relevant functionality (GPTDolomiteModel class) to use the Padding Free transformer optimization.
russellb commented 3 weeks ago
  1. Upstream repo has a training framework for both pretraining and finetuning
  2. Upstream repo supports a lot more model architectures and systems optimizations
  3. All of this is not needed for ILAB
  4. The current fork only has the relevant functionality (GPTDolomiteModel class) to use the Padding Free transformer optimization.

Only needing a subset of a library is incredibly common. That doesn't justify a fork.

I see from https://github.com/instructlab/training/pull/55 that a blocking issue was that the original repo does not have published releases on pypi, which is a blocker for instructlab. That would be a reason for a temporary fork, at least, if we were unable to get the original library released quick enough for our needs. If it's just that, the fork could go away as soon as the upstream library makes its own releases.

I assume there's still more to it that I'm missing?

RobotSail commented 3 weeks ago

We need to publish GPTDolomite as a package and consume it in training. Since the dolomite-engine repo owned by IBM has more than just the model, we pulled out the model and created it in this repo.

russellb commented 3 weeks ago

We need to publish GPTDolomite as a package and consume it in training. Since the dolomite-engine repo owned by IBM has more than just the model, we pulled out the model and created it in this repo.

but if that repo had its own published releases, would that have solved the main issue?

or did the extra features in the upstream repo cause a problem of some kind? If so, are those problems tracked somewhere?

aldopareja commented 3 weeks ago

Let me layout a couple more reasons why we need this:

  1. the upstream repository is an internal tool used by the pretraining team at IBM and they won't keep maintenance for our needs. We carved this out because otherwise maintainability will be a nightmare.
  2. We won't be modifying this in the time being, it's already a pretty robust pipeline.
  3. it makes it easier to expose this to the community, and they can contribute other import functions to build all kinds of models (all the building blocks are here), which gives a 20% boost in training.
  4. it goes well with the philosophy of the training library, which should be a small form factor and extremely fast library (throughput-wise) where everything is exposed enough that it is easy to modify by anyone upstream. Not having this repository would not allow for that.

We will be including more documentation on what this repo is and how it fits into the training repository itself.

russellb commented 3 weeks ago

OK - my takeaway is that the upstream project is not usable by InstructLab:

So, we had no choice but to fork it and maintain our own version of the subset needed by InstructLab?

Is that about right?

RobotSail commented 3 weeks ago

Long-term we should come up with a better solution but for RHEL AI GA this will be fine.

russellb commented 3 weeks ago

no suggestion for an immediate change -- I just wanted to make sure the context was recorded so the reason for the fork is clear to those that come behind us. I'll propose a README update as a resolution to this issue.