Add aggregators for all missing data stages

We now have generated data for several data stages for which we can't compute manifests yet. Hence, this issue aims at listing all the data stages for which new aggregating functions should be implemented, which should be added to compute_manifest.py as an option.

The types are the following:

[x] text-reuse (based on the text-reuse passages)
[x] news-agencies (same schema as entities, they will have exactly the same keys and same aggregations, just a different manifest name)
[x] topics (based on s3://42-processed-data-final/topics/)
[ ] article embeddings (based on s3://42-processed-data-final/embeddings/articles)
[ ] page embeddings (based on s3://42-processed-data-final/embeddings/pages)
[ ] linguistic processing (based on s3://42-processed-data-final/lingproc)

Note that for all data stages which are part of data processing, we can have multiple versions which all stem from the same input data, and have been generated at the same time, simply with different models or parameters. As a result, in order to prevent confusion inside the impresso-data-release repository, the full s3 partition inside the bucket will also be used as path within the repo. Eg.: topics have three different types of outputs: for french, english and german, which are all in their own s3 partition (eg s3://42-processed-data-final/topics/tm-de-all-v2024.08.29/). The relative path within the git repo for this generated manifest would then be: data-processing/topics/tm-de-all-v2024.08.29/topics_v*-*-*.json.

Optionally, it will be possible to define this relative path (to make it simpler for example). I will then be necessary to be alert to the value used for this git relative path, making sure that it stays consistent from one time to the next, noting it will default to the s3 partition.

impresso / impresso-essentials

Add aggregators for all missing data stages #8