[Refactor] Modular package organisation, pre-commit linting suite

lmmx commented 6 months ago

I wanted to review the code here to see how it worked, and I did some refactoring so I could read it more easily.

ba55b97 introduces an explicit namespace (using __all__ variables) and splits apart classes which do not need to share a module, for ease of reading individual components.
- This is purely a reorganisation with no modification to the interface (in case anyone is relying on the current one)
- I checked this work with ruff and followed its guidance to avoid using * imports so imports are also explicit
- I also switched to relative imports which are more concise when using multiple levels:

$ tree src/uform
src/uform
├── chat.py
├── gen_model.py
├── __init__.py
├── models
│   ├── encoders
│   │   ├── __init__.py
│   │   ├── network_layers.py
│   │   ├── text
│   │   │   ├── block.py
│   │   │   ├── encoder.py
│   │   │   └── __init__.py
│   │   └── visual
│   │       ├── block.py
│   │       ├── encoder.py
│   │       └── __init__.py
│   ├── image_utils.py
│   ├── __init__.py
│   ├── triton.py
│   └── vlm.py
└── setup_model.py

4 directories, 16 files

aca18fe adds a pre-commit config with ruff, toml-sort and pyupgrade (set to 3.7+). The 2 lines at the start will run this on CI (via https://pre-commit.ci) for the repo - this avoids requiring contributors to set this tool up
6dd0e3f runs the pre-commit suite, which made some very minor adjustments (newlines at end of file etc).

ashvardanian commented 6 months ago

Looks very nice, thank you @lmmx! I am pretty sure there will be collisions with #57, so give us a bit of time to find the optimal way to merge it 🤗

In the meantime, if you have any other recommendations about this or other repos I maintain (like SimSIMD), please let us know!

lmmx commented 6 months ago

Ah I just got the benchmarks to run and looks like I have made a mistake, I'm getting a warning from running the bench script (whereas no warning from the PyPI packaged version).

Click to show warning

``` (uform) louis 🌟 ~/lab/uform/uform $ python scripts/bench.py UForm-Gen Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 3.86it/s] Some weights of the model checkpoint at unum-cloud/uform-gen were not used when initializing VLMForCausalLM: ['image_encoder.blocks.1.ls2.weight', 'image_encoder.blocks.9.ls1.weight', 'image_encoder.blocks.4.ls2.weight', 'image_encoder.blocks.4.ls1.weight', 'image_encoder.blocks.1.ls1.weight', 'image_encoder.blocks.7.ls1.weight', 'image_encoder.blocks.6.ls2.weight', 'image_encoder.blocks.10.ls2.weight', 'image_encoder.blocks.3.ls1.weight', 'image_encoder.blocks.0.ls1.weight', 'image_encoder.blocks.5.ls1.weight', 'image_encoder.blocks.6.ls1.weight', 'image_encoder.blocks.8.ls1.weight', 'image_encoder.blocks.2.ls2.weight', 'image_encoder.blocks.11.ls2.weight', 'image_encoder.blocks.3.ls2.weight', 'image_encoder.blocks.2.ls1.weight', 'image_encoder.blocks.7.ls2.weight', 'image_encoder.blocks.11.ls1.weight', 'image_encoder.blocks.10.ls1.weight', 'image_encoder.blocks.0.ls2.weight', 'image_encoder.blocks.9.ls2.weight', 'image_encoder.blocks.5.ls2.weight', 'image_encoder.blocks.8.ls2.weight'] - This IS expected if you are initializing VLMForCausalLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model). - This IS NOT expected if you are initializing VLMForCausalLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). Some weights of VLMForCausalLM were not initialized from the model checkpoint at unum-cloud/uform-gen and are newly initialized: ['image_encoder.blocks.0.ls1.gamma', 'image_encoder.blocks.1.ls2.gamma', 'image_encoder.blocks.5.ls2.gamma', 'image_encoder.blocks.2.ls1.gamma', 'image_encoder.blocks.7.ls2.gamma', 'image_encoder.blocks.1.ls1.gamma', 'image_encoder.blocks.3.ls2.gamma', 'image_encoder.blocks.4.ls2.gamma', 'image_encoder.blocks.3.ls1.gamma', 'image_encoder.blocks.10.ls2.gamma', 'image_encoder.blocks.2.ls2.gamma', 'image_encoder.blocks.0.ls2.gamma', 'image_encoder.blocks.6.ls1.gamma', 'image_encoder.blocks.7.ls1.gamma', 'image_encoder.blocks.10.ls1.gamma', 'image_encoder.blocks.6.ls2.gamma', 'image_encoder.blocks.5.ls1.gamma', 'image_encoder.blocks.8.ls2.gamma', 'image_encoder.blocks.9.ls1.gamma', 'image_encoder.blocks.4.ls1.gamma', 'image_encoder.blocks.8.ls1.gamma', 'image_encoder.blocks.9.ls2.gamma', 'image_encoder.blocks.11.ls2.gamma', 'image_encoder.blocks.11.ls1.gamma'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. ```

Will put back to draft until I find. I'll take a look thanks Ash!

Update it was the LayerScale class in gen_model.py, I presumed it could be substituted for the other definition but seemingly not! ab47953 (the checkpoint at unum-cloud/uform-gen expects a layer named weight rather than gamma). I put it in the same dataclass form 79872fd so at least the 2 classes are similar (maybe worth renaming them to distinguish explicitly). Returned this PR to ready for review.

ashvardanian commented 6 months ago

Hi @lmmx! Thank you for the contribution!

We've looked into it with a team and can't merge it in its entirety right now, as it will introduce structural differences between public UForm and our private training repositories.

That said, the CI upgrades look interesting! Would you be open to reverting the structural changes and keeping just the CI?

Thanks again!

lmmx commented 6 months ago

Fair enough, sure I will cancel this PR and make a fresh one (#62) :inbox_tray:

unum-cloud / uform

[Refactor] Modular package organisation, pre-commit linting suite #58