regardscitoyens / the-law-factory-parser

Data generator for the-law-factory project
https://www.lafabriquedelaloi.fr
GNU General Public License v3.0
45 stars 9 forks source link

Identify texts already_done on the fly #104

Closed boogheta closed 5 years ago

boogheta commented 6 years ago

Currently we first build a list of texts done, then start parsing. As a consequence, if a text is processed while a big list of texts is running, we will reparse it anyway.

When this is done, we could use parse_many in parallel (for instance by year to parallelize the data collection)

The complication is that we have only the url at this moment, so we need to identify the corresponding directory

Bonus : there is a redundant "skip_already_done" option in "format_data_for_frontend" that is never called and should be removed