Open tanzir5 opened 2 months ago
@f-hafner This is a more Lucas task, but you can take a dig at it too if you find it interesting.
Yeah; I think it would be great if we can keep all input data after Lucas' processing in one place. It seems to me they're scattered around at the moment.
Discussion with Lucas yielded the following options:
We need to get better data for the language model. This spreadsheet explains what we are utilizing currently: https://docs.google.com/spreadsheets/d/1JzdWpDeB5gWeXaw99akgy8virZq7ERiS6WcZKXKNf0k/edit#gid=0
We need to figure out which files from here are not too expensive (Tom question) and maybe good for the LM to train on. https://www.cbs.nl/en-gb/our-services/customised-services-microdata/microdata-conducting-your-own-research/microdata-catalogue