NGSI-LD Context Support

chicco785 commented 3 years ago

Is your feature request related to a problem? Please describe.

Compared to NGSIv2, NGSI-LD introduces a special field @context, that provides linked-data inspired description of the attributes used in the payload. (cf #398)

e.g.

{
  "id": "urn:ngsi-ld:Vehicle:A4567",
  "type": "Vehicle",
  "speed#1": {
    "type": "Property",
    "value": 55,
    "source": {
      "type": "Property",
      "value": "Speedometer"
    },
    "datasetId": "urn:ngsi-ld:Property:speedometerA4567-speed"
  },
  "speed#2": {
    "type": "Property",
    "value": 54.5,
    "source": {
      "type": "Property",
      "value": "GPS"
    },
    "datasetId": "urn:ngsi-ld:Property:gpsBxyz123-speed"
  },
  "@context": [
    {
      "speed#1": "http://example.org/speed",
      "speed#2": "http://example.org/speed",
      "source": "http://example.org/hasSource"
    },
    "https://uri.etsi.org/ngsi-ld/v1/ngsi-ld-core-context-v1.3.jsonld"
  ]
}

This new attribute should be stored as well in the timeseries backend.

Describe the solution you'd like

Considering the current data model, there could be two options (and i here what's the best one , requires some expertise on NGSI-LD - advices from @jason-fox @kzangeli are welcome!):

QL will persist only the last context (leveraging the metadata table) for an entityType for a given fiwareService, so this means that each time the context changes, the old one is overwritten.
QL will persist only the last context for an entityId for a given fiwareService, so this means that each time the context changes, the old one is overwritten, but different entityId can have different context (this is not particularly brilliant performance wise, because it will increase the number of queries needed to retrieve information )
QL will persist the context for each entry, so this means that you can track along time the evolution of the context, but of course you may end messing up if return context on aggregated queries.

Describe alternatives you've considered N/A

Additional context N/A

kzangeli commented 3 years ago

I would DEFINITELY NOT store the context in the database - the context is not an attribute of the entity. It's more like a map that you must use to expand the aliases (value of entity type + attribute names). These expansions are the real names of the attributes (and real value of the entity type). And that (the longnames) is what you need to store in the database.

If this isn't 100% clear, let's have a chat and I'll explain this in detail.

kzangeli commented 3 years ago

Additionally, it seems you are using an outdated NGSI-LD spec. We changed the format for attributes with datasetId, quite some time ago. We now use an array, instead of various fields with '#x', e.g.:

"P1": [
  {
    "type": "Property",
    "value": 1,
  },
  {
    "type": "Property",
    "value": 2,
    "datasetId": "urn:x1"
  },
  {
    "type": "Property",
    "value": 3,
    "datasetId": "urn:x2"
  }
]

[ Don't want to see you wasting time on implementing obsolete stuff ... :) ]

c0c0n3 commented 3 years ago

@kzangeli thanks for your feedback, always appreciated :-)

So I've been trying to make friends with NGSI-LD but I think he doesn't like me. No honestly, I think it's going to be a while before I can call myself even a moderately knowledgeable LD chap. But what I gather from the spec is that the way you interpret a piece of JSON like the example above depends on the context it's tied to. So we could think of this "interpretation" process as a function

interpret : Context ⨉ Attribute ---> Meaning

For example, say I've got a context ctx that defines a speed attribute as an odometer reading in km/h so

interpret ( ctx, { speed: 20 } ) = you're doing 20 km per hour

Later on, someone decides to change the units and publishes a new context ctx' where speed is in mph so now

interpret ( ctx', { speed: 20 } ) = you're doing 20 miles per hour

If we only stored the speed attribute without a context, how would we interpret speed: 20? Was the car doing 20 km/h or rather 20 mph = 32.1869 km/h? Also I think the two attributes should actually be considered different even if they sit in the same entity, i.e. we have ctx.speed and ctx'.speed at least if I understand RDF mechanics correctly---I'm no RDF expert either, so take my words with a pinch of salt.

So if I understand your suggestion, instead of storing an attribute named speed we should rather store two attributes in this case: expansion_of(ctx.speed) and expansion_of(ctx'.speed). That would solve the semantics problem I think, well as long as it isn't possible for two attributes defined in different contexts to resolve to the same long name. But I don't think that's the case? e.g. if ctx = http://foo/ctx/1.0 and ctx' = http://foo/ctx/2.0 then

expansion_of(ctx.speed)  = http://foo/ctx/1.0/speed
expansion_of(ctx'.speed) = http://foo/ctx/2.0/speed

Is it?

c0c0n3 commented 3 years ago

Also, given an attribute "long" name, we should always be able to retrieve the context in which it was defined, is that so? e.g.

context_of(http://wada/wada/x) = http://wada/wada/

??

chicco785 commented 3 years ago

Additionally, it seems you are using an outdated NGSI-LD spec. We changed the format for attributes with datasetId, quite some time ago. We now use an array, instead of various fields with '#x', e.g.:

to be honest i am quite lost, finding right specs does not seem to be straight forward, this is the document google give me back when we search for the specs: ETSI GS CIM 009 V1.1.1 (dated Jan 2019)

is this backward compatible with NGSIv2? probably not.

Anyhow, for the way we store data today i.e. flat (and that will not change because otherwise query will require join, and performance will be shit) this may be irrelevant. each attribute at level 0 is translated into a db field of a given type based on the "value". basically, we will store all the attribute bloat in an array of objects.

c0c0n3 commented 3 years ago

he way we store data today i.e. flat

yep spot on. We need to think about this carefully, I don't think the way we store data at the moment is NGSI-LD friendly :-)

chicco785 commented 3 years ago

he way we store data today i.e. flat

yep spot on. We need to think about this carefully, I don't think the way we store data at the moment is NGSI-LD friendly :-)

it's timeseries friendly and backward compatible to support ngsi-v2, if that's not good enough, we don't care.

c0c0n3 commented 3 years ago

cool. also, if we stored "long" names, we'd be making alot of people unhappy I reckon since it would be a bit of a mission to e.g. write a query in grafana to pull data out of an entity table...

chicco785 commented 3 years ago

I would DEFINITELY NOT store the context in the database - the context is not an attribute of the entity. It's more like a map that you must use to expand the aliases (value of entity type + attribute names). These expansions are the real names of the attributes (and real value of the entity type). And that (the longnames) is what you need to store in the database.

If this isn't 100% clear, let's have a chat and I'll explain this in detail.

long names are going to be sql query unfriendly, i am not exactly sure to understand why this is actually needed... so we don't plan to do much with the context: nor expanding the name nor any other operation, the point is only to be able to return the context associated to an entity instance since this is needed. if not in the database, where will you store it?

chicco785 commented 3 years ago

to make the rational clear, while when not running aggregations exploding information complexity to represent data may not have an impact, on timeseries assuming you want to compute aggregate on temporal intervals, it does have quite an impact.

this is the rational for which already today we have some clear limitations: if attribute x is today of type number, tomorrow will it be number again, or managing aggregations and so on will be impossible. so far this proved to be accepted by our users, and I think is reasonable to not change this approach moving from ngsi-v2 to ngsi-ld, specially considering use cases we have been dealing so far with.

c0c0n3 commented 3 years ago

I think is reasonable to not change this approach moving from ngsi-v2 to ngsi-ld

Well, we might have to actually. Queries anyone? I think you mentioned this already earlier, but here's the nasty scenario. I'll build on the speed attribute example from my earlier comment. To process speed we need to understand what it is, well to some extent at least. Say we've got this series

(ctx, { speed: 20 }, t1), (ctx, { speed: 30 }, t2), (ctx, { speed: 20 }, t3), (ctx', { speed: 20 }, t4)

How would we compute the average speed?! It turns out that's the wrong question since we have two series actually:

(ctx, { speed: 20 }, t1), (ctx, { speed: 30 }, t2), (ctx, { speed: 20 }, t3)
(ctx', { speed: 20 }, t4)

The average speed for ctx.speed is 20 + 30 + 20 = 23.33 km/h whereas ctx'.speed's average is 20 mph. Notice the units! Adding up values from t1 through t4 would be like adding apples and oranges, nonsense. Oh dear. Lots to think about I guess...

kzangeli commented 3 years ago

So, let's set up an audio conference and straighten things out a little. Seems necessary :)

c0c0n3 commented 3 years ago

hahahaha, yea, good idea :-)

kzangeli commented 3 years ago

Before that, just some food for thought. If you could please forget about storing the context in the DB, and instead storing the attribute name (not the alias - the expanded name, which is the real name of the attribute), you will see how suddenly all your problems go away.

Except one:

In Orion-LD/mongo, I replace all dots (.) in an attribute name for a eq (=), as the dot is used as a separator in the query language.

E.g. GET /entites?q=A.b==12

Meaning: give me all entities that have an attribute named A (whatever that is expanded to using the current context), that have a sub-attribute called 'b' (expanded ...) with a value of 12.

So, the attribute names cannot contain any dots in the DB.

The fix is straightforward:

Creation/Update?

Replace all dots for eq
Store to DB

Query?

Get from DB
Replace all '=' for '.'

Voila. Problem solved!

Footnote: '=' is a forbidden character in an attribute name - I had to pick some forbidden char to use as a replacement for the dot.

kzangeli commented 3 years ago

Here you can find the specs: https://www.etsi.org/committee/cim The latest NGSi-LD API spec is v1.4.1

kzangeli commented 3 years ago

Just remembered, I once wrote a short markdown about the context: https://github.com/FIWARE/context.Orion-LD/blob/develop/doc/manuals-ld/the-context.md

c0c0n3 commented 3 years ago

If you could please forget about storing the context in the DB, and instead storing the attribute name (not the alias - the expanded name, which is the real name of the attribute),

yep, like I said earlier, all things being equal, this is an excellent suggestion, but...

you will see how suddenly all your problems go away.

I wish! Like @chicco785 pointed out, reconciling our internal storage model w/ the requirements of a full-blown NGSI-LD implementation isn't straightforward and we might have to make some compromises :-)

Here you can find the specs... ... I once wrote a short markdown about the context:

excellent, thanks for the pointers, much appreciated!

chicco785 commented 3 years ago

I think is reasonable to not change this approach moving from ngsi-v2 to ngsi-ld

Well, we might have to actually. Queries anyone? I think you mentioned this already earlier, but here's the nasty scenario. I'll build on the speed attribute example from my earlier comment. To process speed we need to understand what it is, well to some extent at least. Say we've got this series
(ctx, { speed: 20 }, t1), (ctx, { speed: 30 }, t2), (ctx, { speed: 20 }, t3), (ctx', { speed: 20 }, t4)
How would we compute the average speed?! It turns out that's the wrong question since we have two series actually:
(ctx, { speed: 20 }, t1), (ctx, { speed: 30 }, t2), (ctx, { speed: 20 }, t3)
(ctx', { speed: 20 }, t4)
The average speed for ctx.speed is 20 + 30 + 20 = 23.33 km/h whereas ctx'.speed's average is 20 mph. Notice the units! Adding up values from t1 through t4 would be like adding apples and oranges, nonsense. Oh dear. Lots to think about I guess...

It stays reasonable not to change :)

This is already a limitation today, you can have metadata in ngsi v2 that specify the unitCode, for example. So it could be entry 1 is kmh and entry 2 is mph. Today we expect this to be uniformed before, if required, the injection in QL. Don't see why this should change, given the overhead either injection and/or querying. While we can go on for hours thinking about whatever complex corner case, pragmatically, we support what we need concretely. Multi unit? Not needed as off today. Easy backward compatibility with NGSIv2? needed.

kzangeli commented 3 years ago

ok, there's a lot I don't know about your implementation ... :) Might be an option to URL-encode attribute names inside the DB? Anyhoo, if you need my help, just call. I'll be happy to help out.

c0c0n3 commented 3 years ago

Anyhoo, if you need my help, just call.

awesome, thanks for offering!!

While we can go on for hours thinking about whatever complex corner case, pragmatically, we support what we need concretely

Oh dear, I've just realised I haven't explained properly what I have in mind, sorry I made a bit of a mess. My example wasn't so much about units (perhaps a corner case, but surely a welcome addition to the spec IMHO) but rather semantics. That is, the function

interpret : Context ⨉ Attribute ---> Meaning

I used earlier as a simple conceptual model to analyse the problem. If you agree the interpretation of an attribute depends on the context, it follows that to be able to interpret the attribute meaningfully in a time series, for each time point and attribute you also need to know the context that attribute came from. In other words a time series for an attribute x of an entity e becomes

(ctx, {x: a }, t1), (ctx, { x: b }, t2), (ctx, { x: c }, t3), (ctx', { x: d }, t4), ...

Notice how at time point t4 the context changed, so in actual fact (if I understand the way RDF works, not 100% sure!) e.x in ctx is not the same as e.x in ctx'. Now suppose we don't store the entirety of the context evolution over time---how we store stuff is irrelevant to my argument, we could take @kzangeli's suggestion and make it work for us or do something different. Without enough info about the context, even the most basic query of all would fail to return meaningful results. For example, if a client asks for e.x between t1 and t4, what values should we return? Surely it can't be the sequence (t1, a), (t2, b), (t3, c), (t4, d), can it be? If the attributes are different there are two value sequences: a, b, c and d but how could we even tell without knowing how the context changed over time? Also even if we know how the context changed over time, we'd still need the client to specify which x is referring to, is it ctx.x or ctx'.x?

kzangeli commented 3 years ago

About speed in ctx and speed in ctx' - those are two different attributes - never mind the unitCode. Two different attributes (as two different expanded names).

github-actions[bot] commented 3 years ago

Stale issue message

pooja1pathak commented 3 years ago

@chicco785 @c0c0n3

QL will persist only the last context (leveraging the metadata table) for an entityType for a given fiwareService, so this means that each time the context changes, the old one is overwritten.

QL will persist only the last context for an entityId for a given fiwareService, so this means that each time the context changes, the old one is overwritten, but different entityId can have different context (this is not particularly brilliant performance wise, because it will increase the number of queries needed to retrieve information )

QL will persist the context for each entry, so this means that you can track along time the evolution of the context, but of course you may end messing up if return context on aggregated queries.

I am not sure how you are planning to implement 1 and 2. But we can easily go with point 3 as we have stored instaceId in https://github.com/orchestracities/ngsi-timeseries-api/issues/533

I have gone through url: https://ngsi-ld-tutorials.readthedocs.io/en/latest/working-with-%40context.html for @context. As per ,my understanding I would like to suggest some points, please correct me if I have wrongly interpreted anything:

We can store context for each entry as suggested in point 3 in separate column.
For aggregated queries we can take all the entries if no context is provided and if context is given we can return the result of attributes of that context only.

I would like to contribute on this issue. Please suggest if I can go in this direction and raise PR for the same.

We can also use @context i.e., https://uri.etsi.org/ngsi-ld/v1/ngsi-ld-core-context-v1.3.jsonld with the context provided which is implicitly included,

chicco785 commented 2 years ago

@kzangeli sorry for not coming back before, have bee geopardised by other priorities :/

i took today some time to think about how to handle the whole thing, and I will be happy to get your feedback, and also, if you have time, schedule a quick chat.

As previously mentioned by @c0c0n3:

storing long names in the attribute values is not ideal for a relational db (not sure it's even possible in all dbs)
plus, for a number of use case, our users access directly the database (when performing complex queries), and this would make complex for them to write queries using either long names or weird hashing.
more than that, we want to be sure that existing data, can be retrieved in future as ngsi-ld, without complex db rewriting.

You also mentioned to not store the context, but to an extend we need to "store" it.

Today when an entity is injected, we use the attribute name to generate the column name to store the attribute values in a flat relational db format.

This means that the following payload:

{
  "id": "urn:ngsi-ld:OffStreetParking:Downtown1",
  "type": "OffStreetParking",
  "name": {
    "type": "Property",
    "value": "Downtown One"
  },
  "availableSpotNumber": {
    "type": "Property",
    "value": 121,
    "observedAt": "2017-07-29T12:05:02Z",
    "reliability": {
      "type": "Property",
      "value": 0.7
    },
    "providedBy": {
      "type": "Relationship",
      "object": "urn:ngsi-ld:Camera:C1"
    }
  },
  "totalSpotNumber": {
    "type": "Property",
    "value": 200
  },
  "location": {
    "type": "GeoProperty",
    "value": {
      "type": "Point",
      "coordinates": [-8.5, 41.2]
    }
  },
  "@context": [
    "http://example.org/ngsi-ld/latest/parking.jsonld",
    "https://uri.etsi.org/ngsi-ld/v1/ngsi-ld-core-context-v1.5.jsonld"
  ]
}

(today) is stored as:	entity_id	name	availableSpotNumber	totalSpotNumber	location	timeIndex
urn:ngsi-ld:OffStreetParking:Downtown1	Downtown One	121	200	{"type": "GeoProperty","value": {"type": "Point","coordinates": [-8.5, 41.2] }	2017-07-29T12:05:02Z

Beside these, we store some metadata:

table_name	entity_attrs
"etoffstreetparking"	{"totalspotnumber":["totalSpotNumber","Integer"],"entity_type":["type","Text"],"time_index":["time_index","DateTime"],"name":["name","Text"],"location":["location","geo:json"],"entity_id":["id","Text"],"availablespotnumber":["availableSpotNumber","Integer"]}

the metadata today are used to tell us for a given attribute in the table, what's the original name in NGSIv2 and the type, e.g.: availablespotnumber column maps to availableSpotNumber NGSI-V2 attribute whose NGSI-V2 type is Integer.

Now, building on this, for NGSI-LD and aiming at backward compatibility, we could have something like:

{
  "totalspotnumber" : [
    "totalSpotNumber",
    "Integer",
    "http://example.org/ngsi-ld/latest/parking/totalSpotNumber"
  ],
  "name" : [
    "name",
    "Text",
    "https://uri.etsi.org/ngsi-ld/name"
  ],
  ...,
  "location" : [
    "location",
    "geo:json",
    "https://uri.etsi.org/ngsi-ld/location"
  ],
  ...
}

this also means that if we have in somepoint:

{
  "id": "urn:ngsi-ld:OffStreetParking:Downtown1",
  "type": "OffStreetParking",
  "name": {
    "type": "Property",
    "value": "Downtown One"
  },
 "http://example.org/ngsi-ld/latest/parking/name": {
    "type": "Property",
    "value": "Downtown One - Parking"
  },
  "@context": [
    "http://example.org/ngsi-ld/latest/parking.jsonld",
    "https://uri.etsi.org/ngsi-ld/v1/ngsi-ld-core-context-v1.5.jsonld"
  ]
}

we will add new metadata, with a new column name, e.g.:

{
  "totalspotnumber" : [
    "totalSpotNumber",
    "Integer",
    "http://example.org/ngsi-ld/latest/parking/totalSpotNumber"
  ],
  "name" : [
    "name",
    "Text",
    "https://uri.etsi.org/ngsi-ld/name"
  ],
  "name-2" : [
    "name",
    "Text",
    "http://example.org/ngsi-ld/latest/parking/name"
  ],
  ...,
  "location" : [
    "location",
    "geo:json",
    "https://uri.etsi.org/ngsi-ld/location"
  ],
  ...
}

Does this sound reasonable, and semantically correct? As a second step, we have to think to "metadata" handling, and we could either:

create a single column for all metadata of an attribute (and provide additional mapping)
create a column for each metadata of an attribute (and provide additional mappings)

kzangeli commented 2 years ago

Let's meet and talk. My Skype handle: kzangeli

c0c0n3 commented 2 years ago

Guys, it seems to me the solution suggested by @chicco785 is the only sensible thing to do at this stage. It won't cater for several NGSI-LD features but in my opinion it'll work in the majority of cases in practice, plus it's backward compatible w/ NGSI v2 which is a boon to the majority of our users I reckon.

Here's some things we won't be able to do easily:

Multi attributes---see my comment to #552.
Multi-edge properties. This is a variation on the above theme and it can happen when merging data from multiple contexts. Here's a simplified example.
Entity name clashes---possible when merging contexts. Similar to the above.
Nested properties as well as relationships---e.g. availableSpotNumber.providedBy in the example entity above.

There might be more things we won't be able to handle, but at the end of the day if these are just corner cases, do we really want to waste alot of dev cycles on them? To me we could just say we're almost NGSI-LD compliant and call it a day. Not sure how many NGSI-LD implementations out there can actually claim full compliance anyway. Is it?

orchestracities / ngsi-timeseries-api

NGSI-LD Context Support #468