Takes a TinyStories file and converts each word to a wordnet node.
Set up a virtualenv and install dependencies.
virtualenv .venv
. .venv/bin/activate
pip install -r requirements.txt
Then install the NLTK data:
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('wordnet')
nltk.download('punkt_tab')
Download TinyStoriesV2-GPT4-train.txt
and TinyStoriesV2-GPT4-valid.txt
Run:
./wordnetify.py --database tinystories.sqlite --progress --file TinyStoriesV2-GPT4-valid.txt
Make a note of how long that took, because the next command will take 100x longer:
./wordnetify.py --database tinystories.sqlite --progress --file TinyStoriesV2-GPT4-train.txt
Install https://ollama.com/, start it (ollama serve
) and download a model (e.g. llama3
)
If you have a lot of time, run this:
./resolve_multisynsets.py --progress --database tinystories.sqlite
In reality, you will almost definitely need a cluster of machines to run this to finish in any sensible length of time. If you have a cluster of 16 single CPU/gpu machines, then on the 3rd machine you would run
./resolve_multisynsets.py --database tinystories.sqlite --congruent 3 --modulo 16
You can use a smaller model, e.g. --model phi3
nltk.download('punkt_tab')
That might complete if you have a few months to run it.
./resolve_multisynsets.py
can use groq. You'll need to give it a groq key.
It's faster, but not fast enough. (And not cheap enough.)
Set up an OpenAI api key. You can supply it on the command-line, or else it will default to ~/.openai.key
./generate_multisynset_batch.py --database TinyStories.sqlite \
--congruent 3 --modulo 1000 \
--output-file .batchfiles/batch-$(date +%F-%T).jsonl \
--limit 40000 --progress-bar \
--batch-id-save-file .batchid.txt
That will take every thousandth story (which is about what we want). The output file doesn't really matter, but it's nice to be able to keep them. OpenAI doesn't like to have more than 40,000 records in one job. Having the ID of the batch is convenient.
./batchcheck.py --database TinyStories.sqlite \
--only-batch $(< .batchid.txt) --monitor
That will keep track of it that batch, show how it is progressing, and stop when it is complete.
./batchfetch.py --database TinyStories.sqlite \
--progress-bar --report-costs
This stores the results back in the database.
40,000 records takes about 100 minutes, and costs about $1.75.
./make_wordnet_database.py --database TinyStories.sqlite
Compile the ultrametric-trees programs, to convert the Tiny Stories sentences into a giant dataframe of wordnet paths.