Background: To make sure that performance claims of the pipeline can be easily validated, and we can also show validation results in the documentation, we need to store the output of the pipeline (and potentially the individual algorithms) on the TVS dataset.
Requirements:
The data should be available without credentials
The data should be available to the RTD build pipeline
It should be easy to download the data
It should be easy to check if results have been updated for some reason
Proposal:
We store the results as PLAIN TEXT JSON FILES in a different Git repo.
This way they can be easily downloaded (we can use pooch for automatic downloads).
Having them as plain text, allows us to easily diff results (e.g. when updating an algorithm).
Having them in git/github, means we don't have to maintain a separate system and we get the ability to version the data if required.
Considerations:
Why not in the same git repo? - While the data is not huge, I don't want to slow donw clone times for people who just want ot work on the code
Will the results not be to large for git? - I don't think so. We are talking a couple of Mb per algorithm block, which should be within the limits of what github allows (100 Mb per file, 5 Gb overall)
Background: To make sure that performance claims of the pipeline can be easily validated, and we can also show validation results in the documentation, we need to store the output of the pipeline (and potentially the individual algorithms) on the TVS dataset.
Requirements:
Proposal: We store the results as PLAIN TEXT JSON FILES in a different Git repo. This way they can be easily downloaded (we can use
pooch
for automatic downloads). Having them as plain text, allows us to easily diff results (e.g. when updating an algorithm). Having them in git/github, means we don't have to maintain a separate system and we get the ability to version the data if required.Considerations: Why not in the same git repo? - While the data is not huge, I don't want to slow donw clone times for people who just want ot work on the code
Will the results not be to large for git? - I don't think so. We are talking a couple of Mb per algorithm block, which should be within the limits of what github allows (100 Mb per file, 5 Gb overall)