TODOs [Nov 25 to Dec 3]

tanzir5 commented 4 days ago

[x] Flavio tries running count_freq_vocab.py on Nov 26.
- If it works, it solves the good vocab n observations question from CBS.
- If it doesn't work, we drop the good vocab file from the export.
- [ ] The export goes out hopefully tomorrow provided Ana's evaluation runs end.
[x] Flavio makes a reservation for 2 GPU nodes for 2 days starting from Dec 2 morning (Monday).
[ ] With the GPUs, the first priority is to get the inferences done.
- [ ] Tanzir tests out the inference code on Snellius and checks how to create the input data for inference.
- [ ] Afterwards, Tanzir writes down detailed steps for running inference and creating input data for inference for the whole Netherlands population.
- This happens by Nov 26.
[ ] Flavio checks if everything is good for inference on Snellius by Nov 29.
[ ] The second priority is to resume training for small and medium2x models, as well as start training for medium and large models.
- This should be fairly straightforward, and Flavio will start this on Dec 2 morning.
[ ] The third priority is to start training small and medium2x models using training data from data-driven-approach.
- [ ] Flavio needs to first go through steps 3 to 5 codes on Snellius and verify he understands the process.
- [ ] Afterwards, he will try running it on the work environment.
- This needs to happen by Nov 29.
- If the work environment runs out of memory, Flavio can run it using the GPU reservation next week.
[ ] The fourth priority is to do the same as the previous task but using medium and large models.
- This happens on Dec 2 or Dec 3.
[ ] The fifth priority is to run inference using all the four models from the data-driven-approach (the previous two tasks).
- This happens on Dec 3.

f-hafner commented 3 days ago

the count_freq_vocab.py does not work, indices is not defined. I dropped the vocab file from the export.

tanzir5 commented 3 days ago

the count_freq_vocab.py does not work, indices is not defined. I dropped the vocab file from the export.

This is on me. I will fix it and will get it ready before our 2nd export.

f-hafner commented 13 hours ago

@tanzir5 , I tried to run pipeline.py without MLM encoding for inference. What I did

copied scripts from your directory into mine: pipeline.py, new_code/constants.py, new_code/custom_vocab.py, new_code/utils.py, tasks/mlm.py

moved data files; the config file now looks like

"DATAPATH": "[homedir]/data/HandPicked/",
"SEQUENCE_PATH": "[homedir]/data/llm/gend_data/people_4mil_random_v1.parquet",
"VOCAB_PATH": "[homedir]/data/llm/gen_data/good_vocab_4mil_v1.csv",
"ENCODING_WRITE_PATH": "[homedir]/data/llm/gen_data/no_mlm_4mil/",
"DO_MLM": false,
[rest unchanged]

The code fails with

in line 169 of pipeline.py, in encode_documents
table = parquet_file.read_row_groups(...)
pyarrow.lib.ArrowNotImplementedError: Nested data conversions not implemented for chunked array outputs

it also happens when I do mlm encoding, and when I do not parallel (parallel = false in the config). It happens when tryong to load the people_4mil_random_v1.parquet

odissei-lifecourse / life-sequencing-dutch

TODOs [Nov 25 to Dec 3] #118