getting oriented - Githubissues

tlh24 / cortex

Apache License 2.0

7 stars 0 forks source link

Was able to get things nominally up and running based on the following:

To install, clone it, cd cortex/ec3 (ec = explore-compress, the intellectual lineage of DreamCoder, which was ec2 in their repo). ./install-deps.sh should build the ocaml executable. run it via ./run.sh -b 512 -g -p -b : batch size (change based on your gpu memory) -g : (optional) debug logging -p : parallel. (Defaults to assuming there are ~16 cores .. i should make that a parameter. turn it off when debugging )

Training: in a separate terminal, cd cortex/ec3 python ec33.py -b 512 This will start training. Batch size needs to be the same.

Dreaming: Once it writes out a model, you can start dreaming in yet another terminal: python ec33.py -b 512 -d where -d : dreaming You can monitor training progress in yet another term python plot_losslog.py -b 512 (window output: assumes running locally)

At present, the dreams don't directly feed back into the training. What I'm working on now. But, this is enough for you to poke around!

Probably going to have some high level questions about how data is getting passed around between the processes here but want to poke it a little first...

Two quick ones to help orient me:

What kind of setup have you been using to train on this so far in terms of hardware? I see a comment in run.sh that reads # use the first 4090 (Second one for python) - does this imply two GPUs with one running the ocaml stuff and the other one doing pytorch?
Anything to think about in terms of setting up the python environment? I'm working from an image with Ubuntu 22.04 + CUDA 12.0 (had to go to datacrunch to get a GPU instance as AWS is being weirdly stingy with my personal account)... was able to get ec33.py running just by doing a naked pip install of torch and matplotlib though that's not best practice obviously. Mainly asking because ocaml is a black box to me for now and I don't really understand what (if any) dependencies might be getting shared between it and a pytorch installation.
- Once i get my bearings a bit I'd be down to maybe try to dockerize the setup procedure here if you think that makes sense

Right, I'm glad that you got it working on your setup / AWS! Definitely need some architectural diagrams to show how data is passed around; I'll work on that shortly. In the meantime:

Exposition: The python model ec33.py takes a specification in terms of three images, A, B, B-A, and uses this to output a series of edits to a small program A. As you likely have seen, it's a ViT + token transformer, encoding only, based on Clip. The production of edits is fully supervised, ala UDRL.

When 'dreaming', the model takes the three image specification, of which B can be an MNIST digit, and generates a new program. This program is added to the database based on criteria (TBD -- currently cosine similarity). Dreaming is also used to replace programs with their simpler equivalents.

The ocaml program manages the programs, and outputs batches for python. They communicate through sockets (python -> ocaml : "update batch" and mem-mapped files (ocaml -> python : new data). There is one mem-maped file that communicates from python -> ocaml, for decoding the edits during the dreaming phase.

Setup: for development, I have two gpus, correct.

One is exclusively training the model. Every 1000 batches, the model is saved to disk.
The other is used for dreaming. Periodically, it reads the model weights from the disk.
Either one can host ocaml, which just needs a GPU to compute VAE embeddings and measure cosine similarity between images and embeddings.

However, I think it will run on one GPU just fine. The training-dreaming split makes 2 cards natural, but not necessary.

Python environment: I've been developing on Debian Bookworm with python 3.10 (torch does not support 3.11) and Cuda 12. It's sufficiently aligned with Lambda stack that I haven't had to touch the python install when deploying there (which is infrequent, as my home computer tends to be faster than a virtualized 8x A100...)

However, I don't have strong opinions here, and would defer to better ideas (provided can stay on OG Debian :)

Docker: yes, if you think it would be good? From my perspective, other features are higher priority.

tlh24 / cortex

getting oriented #2