openbudgets / DAM

OBEU Data Analysis and Mining repository
3 stars 1 forks source link

constraints of input datasets for each data-mining algorithm #13

Open HimmelStein opened 6 years ago

HimmelStein commented 6 years ago

when a user select a dataset, and moves on to the data-mining service. Indigo shall only display data-mining algorithms which can be applied for the selected dataset.

so, please describe constraints of input datasets for your developed data-mining algorithm (send me through email before this Thursday).

wk0206 commented 6 years ago

@larjohn as described in mail, I will add the constraints of each data-mining algorithm in dam.json. So that when you visit dam in the first time, you can get the list. Now I am focus on Timeseries and using it as sample:

Facts: at least one time dimension with years as values, and three or more years available Aggregates: at least one time dimension as drilldown

I changed the dam.json to following format,

> "time_series": {
>     "configurations": {
>       "aggregate": {
>         "inputs": {
>         },
>         "outputs": {
>         },
>         "prompt": XX,
>         "method": XX,
>         "endpoint": XX,
>         "name": "aggregate",
>         "title": "Timeseries of aggregated fiscal data"
>       },
>       "conditions": {
>         "Facts": {
>           "dimension": "year",
>           "numberRestriction": "3+",
>           "formatRestriction": "",
>           "description":"at least one time dimension with years as values, and three or more years available"
>         },
>         "Aggregates": {
>           "dimension": "time",
>           "numberRestriction": "",
>           "formatRestriction": "drilldown",
>           "description":"at least one time dimension as drilldown"
>         }
>       }
>     },
>     "name": "time_series",
>     "title": "Time Series",
>     "description": XX
>   }

The difficult part is how to understand the "Semantics meaning" of " time dimension with years as values", should us only check the "dimension title" to make sure it has "year" inside, or we have to focus on the value, guarantee the regex filter like "^(19|20)\d{2}$".

Actually , how to write/read this condition, we have to listen more to your opinion. If you prefer some other format, such as put the condition as the same level as "input" or "endpoint", it is OK for me too.

larjohn commented 6 years ago

@wk0206 time dimensions should be found in the package query, with datetime dimension type:


{
  "model": {
    "dimensions": {
      "global__functionalClassification__78f10": {
        "dimensionType": "classification"
      },
      "global__economicClassification__569a2": {
        "dimensionType": "classification"
      },
      "global__budgetPhase__afd93": {
        "dimensionType": "classification"
      },
      "global__administrativeClassification__854d0": {
        "dimensionType": "classification"
      },
      "global__classification__9ddd4": {
        "dimensionType": "classification"
      },
      "global__currency__1a842": {
        "dimensionType": "classification"
      },
      "global__fiscalPeriod__28951": {
        "dimensionType": "datetime"
      },
      "global__operationCharacter__0c040": {
        "dimensionType": "classification"
      },
      "global__organization__0eba1": {
        "dimensionType": "location"
      },
      "global__administrativeClassification__70a05": {
        "dimensionType": "classification"
      },
      "global__administrativeClassification__f9d35": {
        "dimensionType": "classification"
      },
      "global__administrativeClassification__38bee": {
        "dimensionType": "classification"
      },
      "global__administrativeClassification__13968": {
        "dimensionType": "classification"
      },
      "global__date__99de8": {
        "dimensionType": "classification"
      }
    },
    "measures": {
      "global__amount__0397f": {
        "currency": "EUR",
        "title": "Global amount"
      }
    }
  },
  "countryCode": null,
  "cityCode": null,
  "name": "global",
  "title": "Global dataset"
}

Regarding the constraints, please put them inside the configuration, as each configuration (usually facts vs aggregates) could have different starting points, requiring different thins, so they can't have the same constraints every time.

Also note, that most of the constraints might be better to be applied at DAM level, so that instead of requiring the algorithms with a generic request, indigo should request per dataset and only get the datasets that apply. A good strategy would be to cache the constraints analysis to avoid overhead.

HimmelStein commented 6 years ago

@larjohn why there is a random tail at each global key? e.g. what is the function of "0eba1" for "globalorganization__0eba1"?

HimmelStein commented 6 years ago

@larjohn let us take the data-mining function 'time series' as the example. The applicable datasets must have a dimension 'fiscalPeriod' and there shall be 3 different values in the dimension 'fiscalPeriod'.

"time_series": { "configurations": { "aggregate": { "inputs": { }, "outputs": { },
"endpoint": <..>, "name": "aggregate", "title": "Timeseries of aggregated fiscal data" }, "conditions": { "Facts": { "dimension": "datetime", "numberRestriction": "3+", "formatRestriction": "", "description":"at least one dimension of type "datetime" with three or more different values available" }, "Aggregates": { "dimension": "datetime", "numberRestriction": "", "formatRestriction": "drilldown", "description":"at least one time dimension as drilldown" } } }, "name": "time_series", "title": "Time Series" }

larjohn commented 6 years ago

@HimmelStein sorry for the delay - I have been sick since last week...

The 'random' tail ensures that datasets from the same region that have similar last URI parts get different name. I can't recall exactly what led me to this, but here is an example:

http://datasets.obeu.com/athens/2016/expenditure http://datasets.obeu.com/thessaloniki/2013/expenditure

In order to select a simple name for those two (not containing dashes etc.) one would use the last part, but it is the same here. So creating a hash of the URI and taking a part of it minimizes name clashes.

The restriction seems good, give me some time to implement it in Indigo.

larjohn commented 6 years ago

@HimmelStein I can't find the updated dam.json. Can you check so that I can update the running instance on the Fraunhofer server?

HimmelStein commented 6 years ago

@larjohn we have not checked in. As we are waiting for your feedback to the format (conditions used for time series), see my last comment above (the json structue)

wk0206 commented 6 years ago

@larjohn I update the dam.json , please check.

larjohn commented 6 years ago
 "conditions": {
          "Facts": {
            "dimension": "datetime",
            "numberRestriction": "3+",
            "formatRestriction": "",
            "description": "at least one dimension of type \"datetime\" with three or more different values available"
          },
          "Aggregates": {
            "dimension": "datetime",
            "numberRestriction": "",
            "formatRestriction": "drilldown",
            "description":"at least one time dimension as drilldown"
          }
        }

So I revisited the constraints, here are my comments:

  1. The constraints should be embedded into each configuration they are applicable for, not the whole algorithm, as different configurations may require different things

  2. The filtering is more obvious to be done at the DAM level but...

  3. ...The first constraint is evaluated before building the algorithm input, while the second is during building the algorithm input. The former should be expected at DAM level (do not show datasets that cannot produce any correct input for this algorithm configuration). The latter should be expected to be evaluated at the front-end (indigo) side. So let's define a way to separate them (a custom attribute?)

  4. dimensionshould be dimension_type, numberRestrictionshould be cardinalNumberRestriction, formatRestrictionshould be roleRestriction(and have these possible values: {measure, field, aggregate, drilldown, sort, cut}