odissei-lifecourse / life-sequencing-dutch

MIT License
0 stars 0 forks source link

TODOs [Nov 25 to Dec 3] #118

Open tanzir5 opened 4 days ago

tanzir5 commented 4 days ago
f-hafner commented 3 days ago

the count_freq_vocab.py does not work, indices is not defined. I dropped the vocab file from the export.

tanzir5 commented 3 days ago

the count_freq_vocab.py does not work, indices is not defined. I dropped the vocab file from the export.

This is on me. I will fix it and will get it ready before our 2nd export.

f-hafner commented 13 hours ago

@tanzir5 , I tried to run pipeline.py without MLM encoding for inference. What I did

The code fails with

in line 169 of pipeline.py, in encode_documents
table = parquet_file.read_row_groups(...)
pyarrow.lib.ArrowNotImplementedError: Nested data conversions not implemented for chunked array outputs

it also happens when I do mlm encoding, and when I do not parallel (parallel = false in the config). It happens when tryong to load the people_4mil_random_v1.parquet