openalto / ietf-hackathon

3 stars 7 forks source link

ALTO: Integrate CERN data model with IETF ALTO format #37

Open jacobdunefsky opened 2 years ago

jacobdunefsky commented 2 years ago

Get access to CERN data and write a script to transform it into a format that can be processed by our hypothetical ML model.

jacobdunefsky commented 2 years ago

After meeting with Mario, we have a better understanding of the application-level data format used by Rucio with CERN. The next step is to design a JSON schema that both:

  1. captures the same information as the current Rucio schema
  2. adheres to the same format/structure as the rest of the IETF ALTO standard
jacobdunefsky commented 2 years ago

The attached files provide a view of my current thoughts re: a new data model. The file "rucio-non-alto.json" is an example of Rucio's current data format; the file "alto-rucio.json" represents the same data under the proposed new format. The idea is that the latter file would be what is returned by an ALTO server.

The new format is based on RFC 8189. The main new feature beyond that params-tuple ordered dict. params-tuple specifies the dimensions of a multidimensional array in which each datapoint will be returned. For instance, in the example, consider

"cost-metric": "queued",
"params-tuple": [
    {"stage": ["Production Input", "Production Output", "total"]},
    {"unit": ["bytes", "files"]}
]

If we pass this value of params-tuple, then data will be returned as a multi-dimensional array, where the first dimension represents the stage that we are looking at (e.g. amount queued at stage Production Input), and the second dimension represents the unit of the data. Thus, an example response might be

[ [4597740396, null], [null, 1], [4597740396, 1] ]

Additionally, note that some of the dimensions of params-tuple can have cardinality 1. This is useful for specifying constraints on the data. As an example, consider

"cost-metric": "throughput",
"params-tuple": [
    {"percentile": 95},
    {"unit": "mbps"},
    {"measurement-interval": ["1h", "1d", "1w"]}
]

Note that the first two dimensions are not associated with arrays, but scalars. The result of this is that the output is constrained to have unit "mbps" and measure based on the 95th percentile of data, but no new dimension is added to the output array:

[ 1.9, 0.5, 10.12 ]

I hope that this format makes sense and is logical. The next step would be to figure out how to elegantly include timestamp information.

jacobdunefsky commented 2 years ago

Apologies; I uploaded an outdated alto-rucio.json that doesn't correspond to rucio-non-alto.json. The correct file is as follows (Github won't seem to let me upload it):

{
    "meta" : {
        "multi-cost-types" : [
            {
                "cost-mode": "numerical",
                "cost-metric": "queued",

                "params-tuple": [
                    {"stage": ["Production Input", "Production Output", "total"]},
                    {"unit": ["bytes", "files"]}
                ]
            },
            {
                "cost-mode": "numerical",
                "cost-metric": "throughput",

                "params-tuple": [
                    {"percentile": 95},
                    {"unit": "mbps"},
                    {"measurement-interval": ["1h", "1d", "1w"]}
                ]
            },
            {
                "cost-mode": "numerical",
                "cost-metric": "closeness"
            }
        ]
    },

    "ipv4:192.0.2.2": {
        "ipv4:192.0.2.89": [
            [ [4597740396, null], [null, 1], [4597740396, 1] ],
            [ 1.9, 0.5, 10.12 ],
            2
        ],
        "ipv4:192.0.2.43": [
            [ [null, 2], [1192314951, 1], [1192314951, 3] ],
            [ 0.4, null, 8.2],
            3
        ]
    }
}