Open tanzir5 opened 4 days ago
the count_freq_vocab.py
does not work, indices
is not defined. I dropped the vocab file from the export.
the
count_freq_vocab.py
does not work,indices
is not defined. I dropped the vocab file from the export.
This is on me. I will fix it and will get it ready before our 2nd export.
@tanzir5 , I tried to run pipeline.py
without MLM encoding for inference. What I did
"DATAPATH": "[homedir]/data/HandPicked/",
"SEQUENCE_PATH": "[homedir]/data/llm/gend_data/people_4mil_random_v1.parquet",
"VOCAB_PATH": "[homedir]/data/llm/gen_data/good_vocab_4mil_v1.csv",
"ENCODING_WRITE_PATH": "[homedir]/data/llm/gen_data/no_mlm_4mil/",
"DO_MLM": false,
[rest unchanged]
The code fails with
in line 169 of pipeline.py, in encode_documents
table = parquet_file.read_row_groups(...)
pyarrow.lib.ArrowNotImplementedError: Nested data conversions not implemented for chunked array outputs
it also happens when I do mlm encoding, and when I do not parallel (parallel = false in the config). It happens when tryong to load the people_4mil_random_v1.parquet
[x] Flavio tries running
count_freq_vocab.py
on Nov 26.[x] Flavio makes a reservation for 2 GPU nodes for 2 days starting from Dec 2 morning (Monday).
[ ] With the GPUs, the first priority is to get the inferences done.
[ ] Flavio checks if everything is good for inference on Snellius by Nov 29.
[ ] The second priority is to resume training for small and medium2x models, as well as start training for medium and large models.
[ ] The third priority is to start training small and medium2x models using training data from
data-driven-approach
.[ ] The fourth priority is to do the same as the previous task but using medium and large models.
[ ] The fifth priority is to run inference using all the four models from the data-driven-approach (the previous two tasks).