Currently we first build a list of texts done, then start parsing. As a consequence, if a text is processed while a big list of texts is running, we will reparse it anyway.
When this is done, we could use parse_many in parallel (for instance by year to parallelize the data collection)
The complication is that we have only the url at this moment, so we need to identify the corresponding directory
Bonus : there is a redundant "skip_already_done" option in "format_data_for_frontend" that is never called and should be removed
Currently we first build a list of texts done, then start parsing. As a consequence, if a text is processed while a big list of texts is running, we will reparse it anyway.
When this is done, we could use parse_many in parallel (for instance by year to parallelize the data collection)
The complication is that we have only the url at this moment, so we need to identify the corresponding directory
Bonus : there is a redundant "skip_already_done" option in "format_data_for_frontend" that is never called and should be removed