Open piconti opened 1 month ago
Embeddings are not ready yet to be versioned, and linguistic processing either (only some titles are present on the S3). As a result, I'll merge the current branch which contains some bugfixes and addition of new aggregators, and will repeat the process when including the embeddingsd aggregators.
We now have generated data for several data stages for which we can't compute manifests yet. Hence, this issue aims at listing all the data stages for which new aggregating functions should be implemented, which should be added to
compute_manifest.py
as an option.The types are the following:
s3://42-processed-data-final/topics/
)s3://42-processed-data-final/embeddings/articles
)s3://42-processed-data-final/embeddings/pages
)s3://42-processed-data-final/lingproc
)Note that for all data stages which are part of data processing, we can have multiple versions which all stem from the same input data, and have been generated at the same time, simply with different models or parameters. As a result, in order to prevent confusion inside the impresso-data-release repository, the full s3 partition inside the bucket will also be used as path within the repo. Eg.: topics have three different types of outputs: for french, english and german, which are all in their own s3 partition (eg
s3://42-processed-data-final/topics/tm-de-all-v2024.08.29/
). The relative path within the git repo for this generated manifest would then be:data-processing/topics/tm-de-all-v2024.08.29/topics_v*-*-*.json
.Optionally, it will be possible to define this relative path (to make it simpler for example). I will then be necessary to be alert to the value used for this git relative path, making sure that it stays consistent from one time to the next, noting it will default to the s3 partition.