orchestracities / ngsi-timeseries-api

QuantumLeap: a FIWARE Generic Enabler to support the usage of NGSIv2 (and NGSI-LD experimentally) data in time-series databases
https://quantumleap.rtfd.io/
MIT License
38 stars 49 forks source link

Notification not processed, Column __original_ngsi_entity__ unknown #381

Closed pinkerltm closed 3 years ago

pinkerltm commented 4 years ago

In our QL logs there is often the following entry:

2020-10-16 03:27:40.405 ERROR:reporter.reporter:Notification not processed or not updated for payload: [{'id': 'urn:ngsi-ld:Device:odp-user-data-watt:d280aa74-611c-4c10-a1d2-a0f0030b6353:watt', 'type': 'Device', 'category': {'type': 'Text', 'value': 'actuator', 'metadata': {}}, 'controlledProperty': {'type': 'Array', 'value': ['power', 'energy'], 'metadata': {}}, 'location': {'type': 'geo:json', 'value': {'coordinates': [[[15.708413583, 47.10982405], [15.708098437, 47.109178286], [15.708792658, 47.108699439], [15.709802027, 47.108866351], [15.710117187, 47.109512114], [15.709422964, 47.109990965], [15.708413583, 47.10982405]]], 'type': 'Polygon'}}, 'location_index': {'type': 'Text', 'value': '8a1e10a552d7fff', 'metadata': {'level': {'type': 'String', 'value': '10'}}}, 'refDeviceModel': {'type': 'Text', 'value': 'urn:ngsi-ld:DeviceModel:odp-user-data-watt:d280aa74-611c-4c10-a1d2-a0f0030b6353', 'metadata': {}}, 'source': {'type': 'Text', 'value': 'odp/user-data-watt', 'metadata': {}}, 'value': {'type': 'Number', 'value': '0.0', 'metadata': {'timestamp': {'type': 'DateTime', 'value': '2020-07-29T15:40:00.000Z'}}}, 'time_index': '2020-07-29T15:40:00+00:00', 'location_centroid': {'type': 'geo:point', 'value': '47.10941360785714, 15.709008634142856'}}]. SQLActionException[ColumnUnknownException: Column __original_ngsi_entity__ unknown] Thread ID: [140427210357536]  Process ID: [15]
2020-10-16 03:27:40.911 INFO:reporter.reporter:Env variable LOGLEVEL not set, using default value of: INFO Thread ID: [140427210357536]  Process ID: [7]
2020-10-16 03:27:40.913 INFO:reporter.reporter:Env variable LOGLEVEL not set, using default value of: INFO Thread ID: [140427210357536]  Process ID: [8]
2020-10-16 03:27:40.918 ERROR:translators.crate:Failed to insert entities because of below error; translator will still try saving original JSON in "etdevice".__original_ngsi_entity__ Thread ID: [140427210357536]  Process ID: [12]
Traceback (most recent call last):
  File "/src/ngsi-timeseries-api/src/translators/sql_translator.py", line 331, in _insert_entity_rows
    self.cursor.executemany(stmt, rows)
  File "/usr/local/lib/python3.8/site-packages/crate/client/cursor.py", line 67, in executemany
    self.execute(sql, bulk_parameters=seq_of_parameters)
  File "/usr/local/lib/python3.8/site-packages/crate/client/cursor.py", line 53, in execute
    self._result = self.connection.client.sql(sql, parameters,
  File "/usr/local/lib/python3.8/site-packages/crate/client/http.py", line 331, in sql
    content = self._json_request('POST', self.path, data=data)
  File "/usr/local/lib/python3.8/site-packages/crate/client/http.py", line 458, in _json_request
    _raise_for_status(response)
  File "/usr/local/lib/python3.8/site-packages/crate/client/http.py", line 187, in _raise_for_status
    raise ProgrammingError(error.get('message', ''),
crate.client.exceptions.ProgrammingError: SQLActionException[ColumnUnknownException: Column __original_ngsi_entity__ unknown]

Neither do we send any attribute called __original_ngsi_entity__ in our payload nor did we define the etdevice table in crate ourselves. Mostly did I trust that QL will create a working schema on its own. We just defined primary keys afterwards to prohibit duplicates. The name of the missing column indicates that the sql translator tries to persist an private python attribute, but thats just my uneducated guess.

I added __original_ngsi_entity__ as a text column to our crate table, but since then I wasn't able to reproduce the error. Nevertheless its just a matter of time until we will get those Errors again.

chicco785 commented 4 years ago

@pinkerltm it seems you are using latest which is not the stable release (which is still 0.7.6), anyhow ql should automatically add columns to crate (at least this is test up to crate 4.1.8) leveraging the column policy 'dynamic'.

by chance did you create/edit manually your table?

can you do "SHOW CREATE TABLE" on the related table?

chicco785 commented 4 years ago

@pinkerltm any feedback?

c0c0n3 commented 3 years ago

@pinkerltm thanks for reporting this :-)

Neither do we send any attribute called original_ngsi_entity in our payload...

QL adds that automatically since #335---have a look at that PR for details of what that column is and what is for.

The name of the missing column indicates that the sql translator tries to persist an private python attribute, but thats just my uneducated guess.

...and you guessed it right, Sir :-)

We just defined primary keys afterwards to prohibit duplicates. ... I added original_ngsi_entity as a text column to our crate table

In general, I'd advise you against changing QL generated tables in any way since it could result in all sorts of wonderful and mysterious errors if e.g. the table definition gets out of sync w/r/t entity metadata stored in the metadata table :-) Surely QL's one-size-fits-all approach to storing data isn't a good match for every data processing scenario, but that's what we've got...

Now the above two changes combined have the side-effect that every time an entity with a duplicate key is notified, the insert will fail and QL will try saving the entity data as is in __original_ngsi_entity__, but this will fail too since the expected column type is object not text---well, at least this is what I think it should happen, but the driver or Crate might convert the in-memory Python dict representing the JSON payload into a string if it notices the column type is text, I haven't tested that scenario, so don't take my word for it :-)

Nevertheless its just a matter of time until we will get those Errors again.

Yes, I'd say that may well be the case. The one thing I still don't understand is why this exception got raised:

crate.client.exceptions.ProgrammingError: SQLActionException[ColumnUnknownException: Column original_ngsi_entity unknown]

The one reason could be the table's column policy isn't set to dynamic. Like @chicco785 said, if you could please post here the output of

SHOW CREATE TABLE mt<your tenant>.etdevice

@chicco785 it looks like another issue @pinkerltm is going to have (sorry @pinkerltm!) is that the JSON payload contains nested arrays:

{ 
  'id': 'urn:ngsi-ld:Device:odp-user-data-watt:d280aa74-611c-4c10-a1d2-a0f0030b6353:watt',
  ...
  'location': {
    'type': 'geo:json', 
    'value': {'coordinates': [[[15.708413583, 47.10982405], [15.708098437, 47.109178286], ...

so even if he had never manually modified the table, every time an insert had failed (for whichever reason), QL would have crashed on trying to save the original data anyway, see #344 about it.

pinkerltm commented 3 years ago

@chicco785 @c0c0n3 my apologies for the long delay ...had another project which claimed all my attention recently, I guess you are aware of such situations.

As stated above, we rely on QL to create the table, and I just altered it to have a primary key to keep it clean of duplicates. We are really struggling to not lose any measurements/state changes, therefore we collect them in Kafka (as an queue) before handing them over to Orion-QL-Crate... and as we also want to prevent postprocessing of the QL generated table (i.e. after several kafka-consumer-replays) we need to prohibit the creation of duplicates.

Anyway, heres the output of SHOW CREATE TABLE ... as it is in use at the moment

CREATE TABLE IF NOT EXISTS "doc"."etdevice" (
   "entity_id" TEXT,
   "entity_type" TEXT,
   "time_index" TIMESTAMP WITH TIME ZONE,
   "category" TEXT,
   "controlledproperty" ARRAY(TEXT),
   "location" GEO_SHAPE INDEX USING GEOHASH,
   "location_index" TEXT,
   "refdevicemodel" TEXT,
   "source" TEXT,
   "value" TEXT,
   "location_centroid" GEO_POINT,
   "fiware_servicepath" TEXT,
   "__original_ngsi_entity__" TEXT,
   PRIMARY KEY ("entity_id", "time_index", "source", "value")
)
CLUSTERED INTO 6 SHARDS
WITH (
   "allocation.max_retries" = 5,
   "blocks.metadata" = false,
   "blocks.read" = false,
   "blocks.read_only" = false,
   "blocks.read_only_allow_delete" = false,
   "blocks.write" = false,
   codec = 'default',
   column_policy = 'strict',
   "mapping.total_fields.limit" = 1000,
   max_ngram_diff = 1,
   max_shingle_diff = 3,
   number_of_replicas = '2-all',
   refresh_interval = 1000,
   "routing.allocation.enable" = 'all',
   "routing.allocation.total_shards_per_node" = -1,
   "store.type" = 'fs',
   "translog.durability" = 'REQUEST',
   "translog.flush_threshold_size" = 536870912,
   "translog.sync_interval" = 5000,
   "unassigned.node_left.delayed_timeout" = 60000,
   "warmer.enabled" = true,
   "write.wait_for_active_shards" = '1'
)
pinkerltm commented 3 years ago

@c0c0n3 ok, if I understand all that right, I think the best approach for our scenario is to alter the type my __original_ngsi_entity__ column to object. That way I would be able to recognize if any (!) failures happen during insert and will have a greater chance to not loose any data due to schema changes for example.

P.S. sorry, giving up the primary key/duplicate prevention is not really an option for us ;)

c0c0n3 commented 3 years ago

@pinkerltm, thank you for getting back to us, no need to apologise! :-)

So I think I've got to the bottom of this. I've posted below detailed explanations but here's the take-home:

  1. You can have a PK constraint on any table created by QL since it doesn't interfere with QL's normal operation.
  2. QL creates tables with a column policy of dynamic. You can't make it strict like you have it at the moment.
  3. The __original_ngsi_entity__ has to have a type of object.
  4. Even if you fix (2) and (3) you might still experience issues when an insert fails and the notified entity payload contains a GeoJSON location with nested arrays.

(4) will take a while to fix, we have two issues to juggle: #344 and #375. One thing you can do until we fix this is ignore errors in the logs related to (4).

What caused the reported error

Your table has a column policy of strict instead of dynamic. When QL creates a new Crate table, it does so by explicitly setting column_policy = 'dynamic'. So this is what I think must've happened. You initially had QL 0.7.5 and the table got created without the __original_ngsi_entity__ column---this feature was introduced by #335 and released in version 0.7.6. Then the table definition got changed slightly by hand: adding a PK and setting the column policy to strict or perhaps you recreated the table yourself? Whatever the case, you ended up with a table, which for the sake of argument, is similar to the test table below

CREATE TABLE test (
    eid TEXT,
    tix TIMESTAMP WITH TIME ZONE, 
    PRIMARY KEY (eid, tix)
) WITH (column_policy = 'strict');

Notice if you don't specify a column policy, Crate defaults it to strict. Next you upgraded to QL 0.7.6 and on inserting data QL encountered an error---this could've been a duplicated PK or something else, it doesn't matter. What QL does if it can't insert the entity on the first shot, it tries to at least save the original JSON that was posted to the notify endpoint in the __original_ngsi_entity__ column---see #335 for the details. The SQL statements QL issued would've been be similar to the ones below:

CREATE TABLE IF NOT EXISTS test (
    eid TEXT,
    tix TIMESTAMP WITH TIME ZONE,
    __orig__ OBJECT
) WITH (column_policy = 'dynamic');

INSERT INTO test (eid, tix, __orig__) VALUES ('d:1', '2020-11-04T17:59:02.531Z', '{ "eid": "d:1" }');

Unfortunately the first statement has no effect so the table definition is left unchanged---exactly the same problem as in #375. In particular the column policy is still strict and there's no __orig__ column, so the INSERT is doomed to fail and if you try that yourself in the Crate admin console, you'll get

SQLActionException[ColumnUnknownException: Column __orig__ unknown]

So it looks like, strictly speaking, this isn't a bug in QL, the column policy shouldn't have been manually changed from dynamic to strict. Case closed? Not quite. Read on...

An accident waiting to happen

Now I think you can keep the PK constraint since it won't interfere with QL's normal operation. But you should definitely change

Cool. The meanies are lying in wait though :-) In fact, next time both the conditions below are met

Then the second insert QL attempts to recover from the first failing will fail too because the __original_ngsi_entity__ will now contain a GeoJSON object with nested arrays and Crate doesn't support that---see #343. Here's the scenario exemplified using our fictitious test table from earlier.

CREATE TABLE test (
    eid TEXT,
    tix TIMESTAMP WITH TIME ZONE,
    loc GEO_SHAPE,
    __orig__ OBJECT,
    PRIMARY KEY (eid, tix)
);

Now say QL gets notified of an entity update, tries to insert it

INSERT INTO test (eid, tix, loc) 
VALUES ('d:1', '2020-11-04T17:59:02.531Z', 
    { type = 'Polygon', coordinates = [[ [100.0, 0.0], [101.0, 0.0], [101.0, 1.0], [100.0, 1.0], [100.0, 0.0] ]] }
);

and it goes well but soon afterwards, it gets notified of the same update again, so the PK constraint will make the insert fail---notice the insert could fail for any other reason, the crux of the matter here is that there's a location in the payload containing nested arrays. Next thing QL tries to do is stash away the original JSON POSTed to the notify endpoint with another insert similar to

INSERT INTO test (eid, tix, __orig__) 
VALUES ('d:1', '2020-11-04T17:59:02.531Z',
    '{
      "eid": "d:1",
      "loc": { "type": "Polygon", "coordinates": [[ [100.0, 0.0], [101.0, 0.0], [101.0, 1.0], [100.0, 1.0], [100.0, 0.0] ]] }
    }'
 );

but this will fail too because the JSON in the __orig__ column value contains nested arrays. Oh dear. It looks like we're stuck between a rock and a hard place. Unfortunately I can't offer you any workaround at the moment except for ignoring this kind of failure for the time being. We'll have to rethink how to recover from insert failures, so hopefully this won't be an issue going forward, keep an eye on #344.

chicco785 commented 3 years ago

@pinkerltm i suppose we can close the issue given that ColumnUnknownException: Column __original_ngsi_entity__ unknown is caused by hacking of tables and not a QL bug.

pinkerltm commented 3 years ago

@chicco785 wasn't able to verify it yet, but with the great explanation I should be able to fix it. Thanks a lot

chicco785 commented 3 years ago

@pinkerltm fyi in PR#373 we are changing columns default from dynamic to strict and manage the add of columns in ql, which would avoid any issue, when it's released. it may be experimental for a while though, due to the fact of quite some changes we introduced, and that may need additional testing. these are all updates coming: