openbudgets / pipeline-fragments

Reusable fragments of LinkedPipes ETL pipelines
2 stars 3 forks source link

FDPtoRDF: could we have operationCharacter and budgetPhase as attributes? #10

Closed marek-dudas closed 8 years ago

marek-dudas commented 8 years ago

During testing, I found out that the pipeline does not handle correctly the fact that FDP equivalents of operationCharacter and budgetPhase are properties of FDP measures.

I think @jindrichmynarz is most qualified to answer this.

jindrichmynarz commented 8 years ago

So the actual problem is that operationCharacter and budgetPhase in FDP datasets don't adhere to the constraints on qb:DimensionProperty, i.e. available for every observation?

operationCharacter and budgetPhase are not attributes in the DCV sense, because they don't qualify measures, so I'd not cast them as instances of qb:AttributeProperty.

For these cases, we minted obeu:OptionalProperty derived from the generic qb:ComponentProperty. This property is meant to be used for component properties that are missing for some observations, but don't adhere to the semantics of qb:AttributeProperty.

One solution to the problem you raised could thus be to mint instances of obeu:OptionalProperty in the FDP-specific namespace. These can be related to obeu-dimension:operationCharacter and obeu-dimension:budgetPhase by sharing the same rdfs:range (i.e. obeu:OperationCharacter and obeu:BudgetPhase). Alternatively, we can also link them using shared qb:concepts, which is already used for some component properties (e.g., both obeu-attribute:currency and obeu-dimension:currency link to sdmx-concept:currency).

However, if you use obeu:OptionalProperty you need to ensure that the values of the remaining dimensions still uniquely identify the observations (i.e. as per IC-12 you cannot have multiple observations sharing the same dimension values). If this requirement cannot be satisfied, perhaps this is in fact an indication of errors in the FDP datasets. In such case, I'd stick to using dimensions to make the errors clear.

marek-dudas commented 8 years ago

I would rather say the actual problem is FDP considering the properties as attributes of measures and OBEU as dimensions characterizing each observation. The dimension vs. attribute difference seems quite philosophical to me.

Using obeu:OptionalProperty however seems like the compromise I was hoping for. Both discussed properties are defined only in the descriptor, not the CSV table, and are not part of observation's identification, since they have always the same value for all observations in the dataset, so I don't see any problem there. I will implement it this way for now. The resulting datasets can be, with some effort, transformed with an additional pipeline into the "one dataset per each measure" format in the future if neccessary. Thanks for the swift response!

jindrichmynarz commented 8 years ago

The dimension vs. attribute difference seems quite philosophical to me.

The definition of attributes in DCV is quite clear: attributes qualify a measure. What is the definition of attribute in FDP?

Both discussed properties are defined only in the descriptor, not the CSV table, and are not part of observation's identification, since they have always the same value for all observations in the dataset, so I don't see any problem there.

If they are defined in the descriptor, they apply to all observations, right (qb:componentAttachment qb:DataSet)? If that is the case I see no problem with using the original dimensions obeu-dimension:operationCharacter and obeu-dimension:budgetPhase.

marek-dudas commented 8 years ago

They apply to all observations, but there can be several different operation characters or budget phases in one dataset, each related to different measure, and as they are optional, the same dataset could include a measure without budget phase and/or operation character specified. I could create subproperties of the obeu operation character and budget phase dimensions for each measure and include the related measure name in their uri, but I don't think that hiding info in uri is a good practice.

Maybe it would be better to discuss a specific example, as I am not good with explanations.

Here is a part of a json descriptor describing the measures. "Direction" is operation character and "phase" is budget phase. It's from boost-armenia datapackage

"mapping": {
    "measures": { 
      "approved_amount": {
        "source": "approved",
        "direction": "expenditure",
        "phase": "approved",
        "currency": "AMD"
      },
      "adjusted_amount": {
        "source": "adjusted",
        "direction": "expenditure",
        "phase": "adjusted",
        "currency": "AMD"
      },
      "executed_amount": {
        "source": "executed",
        "direction": "expenditure",
        "phase": "executed",
        "currency": "AMD"
      }
    },
(...)

and here are a few lines from the CSV:

year,admin,econ1,econ2,econ3,econ4,func1,func2,func3,program,exp_type,Econ/func,transfer,approved,adjusted,executed
2006,101001 Staff of President of RA,4000 Running expenses,4100 Payment for labor,4110 Salaries and additional payments paid in drams,4111 Salaries and additional payments of employees,01 General public services,"0101 Legislative and executive bodies, public administration, financial and fiscal relations, foreign affairs","010101 Legislative and executive bodies, public administration",,1 Personnel,Function,Excluding transfers,335071800,330723300,330723200
2006,101001 Staff of President of RA,4000 Running expenses,4100 Payment for labor,4110 Salaries and additional payments paid in drams,"4113 Civil, judicial and other public servants remuneration",01 General public services,"0101 Legislative and executive bodies, public administration, financial and fiscal relations, foreign affairs","010101 Legislative and executive bodies, public administration",,1 Personnel,Function,Excluding transfers,10718200,10718200,10712800
2006,101001 Staff of President of RA,4000 Running expenses,4100 Payment for labor,4130 Actual social security payments,4131 Social security payments,01 General public services,"0101 Legislative and executive bodies, public administration, financial and fiscal relations, foreign affairs","010101 Legislative and executive bodies, public administration",,1 Personnel,Function,Excluding transfers,63093400,57201200,57201200
jindrichmynarz commented 8 years ago

They apply to all observations, but there can be several different operation characters or budget phases in one dataset, each related to different measure.

So they are specific to a measure, like qb:componentAttachment qb:MeasureProperty? In that case, if you want to translate to DCV this as literally as possible, you would mint multiple measures. However, I don't think close translation matters and seems sub-optimal in this case.

I could create subproperties of the obeu operation character and budget phase dimensions for each measure and include the related measure name in their uri, but I don't think that hiding info in uri is a good practice.

No, this is not a good practice.

In case of the example, all the measure attributes are available for all measures, so there is no problem in translating them to dimensions (or attributes, if that's more appropriate). Should you have measures with different FDP attributes, you'd model those not used for all measures either as obeu:OptionalProperty or qb:AttributeProperty depending on their semantics.

marek-dudas commented 8 years ago

I already am minting multiple measures on one observation. I see two solutions without possible future problems: a) creating attributes in FDP namespace attached to measures and using OBEU predefined values with them b) creating separate datasets for each measure Creating separate observations for each measure and modeling budget phase and operation character as dimension would involve possible problem since some observation could then have that dimension value missing, e.g. in such case

"mapping": {
    "measures": { 
      "approved_amount": {
        "source": "approved",
        "direction": "expenditure",
        "phase": "approved",
        "currency": "AMD"
      },
      "adjusted_amount": {
        "source": "adjusted",
        "direction": "expenditure",
        "phase": "adjusted",
        "currency": "AMD"
      },
      "foo_amount": {
        "source": "foo",
        "currency": "AMD"
      },

I would go with a) at this moment.

jindrichmynarz commented 8 years ago

Why not cast the intersection of FDP measure attributes as instances of either qb:DimensionProperty or qb:AttributeProperty (depending on their semantics) and cast the rest as instances of either obeu:OptionalProperty or qb:AttributeProperty (depending on their semantics)?

marek-dudas commented 8 years ago

There are only two FDP measure attributes we are dealing with: "direction" and "phase". I am casting them as qb:AttributeProperty. Is that ok?

jindrichmynarz commented 8 years ago

So the above examples and discussion were only hypothetical?

marek-dudas commented 8 years ago

No, it is purely practical, the first example is a real FDP dataset. It's just that there are only two such problematical FDP "measure attributes" - "direction" and "phase". So we are not looking for a general way of transforming any "FDP measure attribute", just a specific solution for these two.

The problem I have is that they are both optional and attached to measures. Which I think an optional qb:AttributeProperty attached to qb:MeasureProperty solves acceptibly.

jindrichmynarz commented 8 years ago

OK. I assume you currently map direction to obeu-dimension:operationCharacter and phase to obeu-dimension:budgetPhase. Correct?

marek-dudas commented 8 years ago

Yes, because originally I incorrectly thought both are (in FDP) attached to the whole dataset. Now I found out they are attached to FDP measures.

jindrichmynarz commented 8 years ago

I'd use the approach I proposed above in this comment. I think this is preferable to multiple measures, because it is closer to the OpenBudgets.eu data model (no need for obeu-measure:amount subproperties, avoids reinventing core component properties unless necessary due to DCV's cardinality constraints).

A side note: Is it always a budget phase? We also have obeu-dimension:paymentPhase to cater for phases in spending data.

marek-dudas commented 8 years ago

On the other hand, obeu-measure:amount subproperties are closer to FDP data model: there is for example usually some semantics hidden in the name of the measure property (like "EU_amount", "Total_amount"...) So what you mean is using the "measure dimension" with qb:measureType approach?

I am currently aiming at "working pipeline producing OBEU compliant dataset not violating any constraints". It could be of course enhanced to create nicer dataset, but it would take time, a lot of time in my case. And since I have the issue discussed here already almost solved the way I proposed, I would stick with it for now. Any other solution would I think mean large-size rebuilding of many parts of the pipeline.

It is always a budget phase, the allowed FDP values directly correspond to obeu:BudgetPhase instances.

jindrichmynarz commented 8 years ago

So what you mean is using the "measure dimension" with qb:measureType approach?

No, that would be the multi-measure approach. I suggest using a single measure with separate dimensions and attributes instead of several measures that have particular values of dimensions of attributes baked in (i.e. qb:componentAttachment qb:MeasureProperty).

I am currently aiming at "working pipeline producing OBEU compliant dataset not violating any constraints".

Both the multi-measure approach and my suggestion are compatible with the OBEU data model.

It could be of course enhanced to create nicer dataset, but it would take time, a lot of time in my case. And since I have the issue discussed here already almost solved the way I proposed, I would stick with it for now. Any other solution would I think mean large-size rebuilding of many parts of the pipeline.

As an outsider, it seems to me that multi-measure approach is more complicated. What do you think is difficult about my suggestion?

marek-dudas commented 8 years ago

So, e.g., three FDP measures would result into three observations with some artificially created dimension specifying which original FDP measure the obeu:amount corresponds to in that observation?

marek-dudas commented 8 years ago

As the multi-measure approach (this one to be clear) is currently implemented, anything else seems more complicated to me as it would mean changing the pipeline. The multi-measure seems to correspond to FDP model better, and so it seems less complicated to me.

marek-dudas commented 8 years ago

An example (fragment of) output of the result of my "least effort" solution based on (I hope) your suggestions:

<http://data.openbudgets.eu/ontology/dsd/esif2014> a qb:DataStructureDefinition ;
    qb:component _:node1aok8fghtx1 , _:node1aok8fghtx2 , _:node1aok8fghtx3 , _:node1aok8fghtx4 , _:node1aok8fghtx5 , <http://data.openbudgets.eu/ontology/dsd/esif2014/component/budgetPhase> , <http://data.openbudgets.eu/ontology/dsd/esif2014/component/operationCharacter> , _:node1aok8fghtx10 , _:node1aok8fghtx6 , _:node1aok8fghtx7 , _:node1aok8fghtx8 , _:node1aok8fghtx9 .

_:node1aok8fghtx3 qb:measure <http://data.openbudgets.eu/ontology/dsd/esif2014/measure/EU_Amount> .

<http://data.openbudgets.eu/ontology/dsd/esif2014/measure/EU_Amount> obeu-attribute:currency <http://data.openbudgets.eu/codelist/currency/EUR> ;
    a rdf:Property , qb:MeasureProperty ;
    rdfs:subPropertyOf obeu-measure:amount ;
    <http://schemas.frictionlessdata.io/fiscal-data-package#budgetPhase> obeu-budgetphase:approved ;
    <http://schemas.frictionlessdata.io/fiscal-data-package#operationCharacter> obeu-operation:expenditure .

_:node1aok8fghtx4 qb:measure <http://data.openbudgets.eu/ontology/dsd/esif2014/measure/National_Amount> .

<http://data.openbudgets.eu/ontology/dsd/esif2014/measure/National_Amount> obeu-attribute:currency <http://data.openbudgets.eu/codelist/currency/EUR> ;
    a rdf:Property , qb:MeasureProperty ;
    rdfs:subPropertyOf obeu-measure:amount ;
    <http://schemas.frictionlessdata.io/fiscal-data-package#operationCharacter> obeu-operation:revenue .

_:node1aok8fghtx5 qb:measure <http://data.openbudgets.eu/ontology/dsd/esif2014/measure/Total_Amount> .

<http://data.openbudgets.eu/ontology/dsd/esif2014/measure/Total_Amount> obeu-attribute:currency <http://data.openbudgets.eu/codelist/currency/EUR> ;
    a rdf:Property , qb:MeasureProperty ;
    rdfs:subPropertyOf obeu-measure:amount ;
    <http://schemas.frictionlessdata.io/fiscal-data-package#operationCharacter> obeu-operation:revenue .

<http://data.openbudgets.eu/ontology/dsd/esif2014/component/budgetPhase> qb:attribute <http://schemas.frictionlessdata.io/fiscal-data-package#budgetPhase> ;
    qb:componentAttachment qb:MeasureProperty ;
    qb:componentRequired false .

<http://data.openbudgets.eu/ontology/dsd/esif2014/component/operationCharacter> qb:attribute <http://schemas.frictionlessdata.io/fiscal-data-package#operationCharacter> ;
    qb:componentAttachment qb:MeasureProperty ;
    qb:componentRequired false .

<http://schemas.frictionlessdata.io/fiscal-data-package#budgetPhase> a qb:AttributeProperty , rdf:Property ;
    rdfs:range obeu:BudgetPhase .

<http://schemas.frictionlessdata.io/fiscal-data-package#operationCharacter> a qb:AttributeProperty , rdf:Property ;
    rdfs:range obeu:OperationCharacter .

<http://data.openbudgets.eu/resource/dataset/esif2014/observation/1fa31e2c-faf3-4952-8185-aec2df4505a9> a qb:Observation ;
    <http://data.openbudgets.eu/ontology/dsd/esif2014/measure/EU_Amount> 3730936.91 ;
    <http://data.openbudgets.eu/ontology/dsd/esif2014/measure/National_Amount> 3618957.31 ;
    <http://data.openbudgets.eu/ontology/dsd/esif2014/measure/Total_Amount> 7349894.22 ;
    <http://data.openbudgets.eu/ontology/dsd/esif2014/dimension/unknown> "M02" ;
    <http://data.openbudgets.eu/ontology/dsd/esif2014/dimension/administrator> <http://data.openbudgets.eu/resource/dataset/esif2014/administrator/AT> ;
    <http://data.openbudgets.eu/ontology/dsd/esif2014/dimension/date> <http://reference.data.gov.uk/id/gregorian-year/2014> ;
    <http://data.openbudgets.eu/ontology/dsd/esif2014/dimension/functional-classification> <http://data.openbudgets.eu/resource/dataset/esif2014/functional-classification/2014AT06RDNP001-1> ;
    <http://data.openbudgets.eu/ontology/dsd/esif2014/dimension/fin-source> "EAFRD" ;
    qb:dataSet <http://data.openbudgets.eu/resource/dataset/esif2014> .
jindrichmynarz commented 8 years ago

On the one hand, having component properties attached to measures is easier to implement, because that's how FDP does things. On the other hand, attaching properties to observations is more natural to DCV, as dimensions on measures don't make much sense. Data modelling decisions should thus consider their implementation cost. If it is costly to produce a representation more in line with DCV, unless it is offset by benefits for the users of the data, then go with the FDP way.

Your last data snippet uses the approach with multiple measures that you proposed, so we probably got this confused. What I propose is:

We can return to the original question "Could we have operationCharacter and budgetPhase as attributes?" to obtain more clarity. If using these component properties as dimensions requires undue implementation effort, then you can definitely mint new attributes with similar interpretation.

marek-dudas commented 8 years ago

Regarding the operationCharacter and budgetPhase issue itself and its solution at this moment: is what I showed in the snippet above an acceptible solution for now? From my side it is, I have it implemented and it seems to be working and giving valid output.

I would propose creating a separate issue marked as "enhancement" for discussion of the single measure vs. multiple measure approaches. I think I am finally starting to partially understand what you mean. In any case, can we agree it would be nice to have a different approach for dealing with multiple measures, but as it would take some time to implement it and the current approach is acceptible, we will leave it to a possible next version of the pipeline?

jindrichmynarz commented 8 years ago

Sure, let's discuss this in another issue in case users of the data produced by the FDP2RDF pipeline would find the chosen modelling difficult.

Your example seems to be fine. I'd only suggest using obeu:OptionalProperty instead of qb:AttributeProperty for <http://schemas.frictionlessdata.io/fiscal-data-package#budgetPhase> and <http://schemas.frictionlessdata.io/fiscal-data-package#operationCharacter> (and similarly qb:componentProperty instead of qb:attribute for relating these component properties to their component specifications).

A minor side question: Is there a reason to use snake case, such as EU_Amount instead of kebab-case (i.e. LCASE(REPLACE("EU_Amount", "_", "-"))) recommended by the OpenBudgets.eu data model?

marek-dudas commented 8 years ago

Thanks, I'll adjust it to OptionalProperty and keep it that way for now.

The names such as EU_Amount come directly from the FDP descriptor. I can put name adjustment at the end of the pipeline to the "nice to have features" list.