Open jacobdunefsky opened 2 years ago
After meeting with Mario, we have a better understanding of the application-level data format used by Rucio with CERN. The next step is to design a JSON schema that both:
The attached files provide a view of my current thoughts re: a new data model. The file "rucio-non-alto.json" is an example of Rucio's current data format; the file "alto-rucio.json" represents the same data under the proposed new format. The idea is that the latter file would be what is returned by an ALTO server.
The new format is based on RFC 8189. The main new feature beyond that params-tuple
ordered dict. params-tuple
specifies the dimensions of a multidimensional array in which each datapoint will be returned. For instance, in the example, consider
"cost-metric": "queued",
"params-tuple": [
{"stage": ["Production Input", "Production Output", "total"]},
{"unit": ["bytes", "files"]}
]
If we pass this value of params-tuple
, then data will be returned as a multi-dimensional array, where the first dimension represents the stage that we are looking at (e.g. amount queued at stage Production Input), and the second dimension represents the unit of the data. Thus, an example response might be
[ [4597740396, null], [null, 1], [4597740396, 1] ]
Additionally, note that some of the dimensions of params-tuple
can have cardinality 1. This is useful for specifying constraints on the data. As an example, consider
"cost-metric": "throughput",
"params-tuple": [
{"percentile": 95},
{"unit": "mbps"},
{"measurement-interval": ["1h", "1d", "1w"]}
]
Note that the first two dimensions are not associated with arrays, but scalars. The result of this is that the output is constrained to have unit "mbps" and measure based on the 95th percentile of data, but no new dimension is added to the output array:
[ 1.9, 0.5, 10.12 ]
I hope that this format makes sense and is logical. The next step would be to figure out how to elegantly include timestamp information.
Apologies; I uploaded an outdated alto-rucio.json
that doesn't correspond to rucio-non-alto.json
. The correct file is as follows (Github won't seem to let me upload it):
{
"meta" : {
"multi-cost-types" : [
{
"cost-mode": "numerical",
"cost-metric": "queued",
"params-tuple": [
{"stage": ["Production Input", "Production Output", "total"]},
{"unit": ["bytes", "files"]}
]
},
{
"cost-mode": "numerical",
"cost-metric": "throughput",
"params-tuple": [
{"percentile": 95},
{"unit": "mbps"},
{"measurement-interval": ["1h", "1d", "1w"]}
]
},
{
"cost-mode": "numerical",
"cost-metric": "closeness"
}
]
},
"ipv4:192.0.2.2": {
"ipv4:192.0.2.89": [
[ [4597740396, null], [null, 1], [4597740396, 1] ],
[ 1.9, 0.5, 10.12 ],
2
],
"ipv4:192.0.2.43": [
[ [null, 2], [1192314951, 1], [1192314951, 3] ],
[ 0.4, null, 8.2],
3
]
}
}
Get access to CERN data and write a script to transform it into a format that can be processed by our hypothetical ML model.