Orion responses with stack hang up and subscriptions with patterns

tstorek commented 5 years ago

Hello,

I am using Orion in the following in a larger scenario with docker-swarm:

1 MongoDB (v4)(Not scaled yet)
5 CBs (v2.2.0) in parallel
5 QuantumLeaps
IoTA-UL

I am sending about 50-60 values per seconds. Unfortunately, not all of them are handled well and my IoTA-ERROR-Log grows quite with always the same message:

time=2019-06-24T22:29:26.001Z | lvl=ERROR | corr=7afde723-3e03-41f2-b517-771052166456 | trans=7afde723-3e03-41f2-b517-771052166456 | op=IoTAgentNGSI.NGSIService | srv=rwth_n5geh_acsebc | subsrv=/bldg/rwth4120/moni | msg=Error found executing update action in Context Broker: Error: socket hang up | comp=IoTAgent time=2019-06-24T22:29:26.001Z | lvl=ERROR | corr=7afde723-3e03-41f2-b517-771052166456 | trans=7afde723-3e03-41f2-b517-771052166456 | op=IOTAUL.Common.Binding | srv=rwth_n5geh_acsebc | subsrv=/bldg/rwth4120/moni | msg= MEASURES-002: Couldn't send the updated values to the Context Broker due to an error: Error: socket hang up | comp=IoTAgent

After a while I find another ERROR in my log:

iotagent-ul > time=2019-06-25T14:39:50.062Z | lvl=ERROR | corr=7afde723-3e03-41f2-b517-771052166456 | trans=7afde723-3e03-41f2-b517-77105216 6456 | op=IoTAgentNGSI.Alarms | srv=rwth_n5geh_acsebc | subsrv=/bldg/rwth4120/moni | msg=Raising [MONGO-ALARM]: {"name":"DEVICE_GROUP_NOT_FOUND","message":"Couldn\t find device group","code":404} | comp=IoTAgent

For my 18 subscriptions I use patterns based on the entity-type in order to manage my scenario with over 14000 data sources. Looking like this:

{"_id":"5d0ce62ed9aa14f90c265192","expiration":"1577210400","reference":"http://quantumleap:8668/v2/notify","custom":false,"throttling":"0","servicePath":"/bldg/rwth4120/moni","status":"active","entities":[{"id":".*","isPattern":"true","type":"bacnet:sensor:co2","isTypePattern":false}],"attrs":[],"metadata":["dateCreated","dateModified"],"blacklist":false,"conditions":[],"lastNotification":"1561478113","count":"235519","expression":{"q":"","mq":"","geometry":"","coords":"","georel":""},"format":"normalized","lastSuccess":"1561478113","lastSuccessCode":"200"}

Having these Errors I cannot guarantee that all subscriptions are executed correctly and that the platform will be stable in the very end:/

Is this a bug? I face the same issue without scaling the CB. Or do I need to move it to stackoverflow? From my understanding this schouldn't be an issue for Fiware to handle thís data flow. On Stackoverflow I only found this but it is not directly helping because i would like to use the described setup: https://stackoverflow.com/questions/53818381/how-can-i-improve-socket-hang-up-when-connecting-many-devices

Thanks in advance

tstorek commented 5 years ago

I did not try threadpool though because I am not sure, if it is the rigth scenario for this. It does not matter if I use cache or not. But for my understanding "-noCache" is only for debugging and in very large scenarios with over 300000 subscriptions. I only found this old issue for this: https://github.com/telefonicaid/fiware-orion/issues/2780

Hence, it is always good to keep the naumber of single subscriptions down and use patterns? Or will they have the same effect?

kzangeli commented 5 years ago

300,000 subscriptions ... that's a lot :) When using sub cache they are all stored in RAM. Keeping down the number of subscriptions by using patterns seems like a good idea.

tstorek commented 5 years ago

Well, assuming a district of buildings each having about 10000 data points the number is reached quite quickly, :) if I was subscribing each data point individually for storing it with quantumleap and crate. This would give a better handling in throttling individual data points. Hovever, I am not sure how mongo iterates entries. So maybe their is not even difference in patterns or individual handling but the fact that I have better overview. What would be the influence of throttling here. Does it account for the subscription pattern or for each entry that matches the pattern individuallly otherwise it's use together with patterns would be not clear to me.

But back to topic. Would threadpooling really help here? Since I only sent notification to my quantumleap replicas so far. What would happen with the load balancing mechanism if a keep the connections open?

fgalan commented 5 years ago

I think we are misleading two concepts: entities (representing the data points in a building) and subscriptions (the way Orion has to implement asynchronous context consumption, based on notifications). As far as I understand by your case description:

For my 18 subscriptions [...] with over 14000 data sources.

So you have 14000 entities and 18 subscriptions. Is that correct?

In addition, I don't fully understand what you mean by:

I cannot guarantee that all subscriptions are executed correctly

What do you mean by "subscription execution"? Could you elaborate on it, please?

fgalan commented 5 years ago

Side-note: MongoDB 4.0 is not the official MongoDB version supported by Orion (check: https://fiware-orion.readthedocs.io/en/master/admin/install/index.html#requirements). Probably it doesn't have any impact in the case... but I let you know about it ;)

fgalan commented 5 years ago

With regards to:

1 MongoDB (v4)(Not scaled yet)

5 CBs (v2.2.0) in parallel

5 QuantumLeaps

IoTA-UL

In addition, could you attach an architecture diagram showing how they are connected, please?

tstorek commented 5 years ago

@fgalan I think, we both have the same understanding:

data point (incl. static metadata) are the devices and entities in this case because I use a 1 to 1 relations for now.
the subscriptions I make are based on a type that generate while provisioning for each devices measuring a certain property based on a small data model (could be for a temperature sensor located in my building automation network (BACnet) --> bacnet:sensor:temperature)

Since the type mapping is limited for now, I come up with 18 subscriptions in total.

But in an earlier version without the data model I had to subscribe each entity individually in order to watch for updates on my active data attribute and notify one of the quantum leap replica. This means 14000 subscriptions.

With the execution of a subscription I mean the execution of the notification mechanism. I assume that whenever an update is recognized one of my CB executes the notification mechanism and updates the lastNotification attribute. Thus, the other CB do not execute it anymore until the state for for subscriptions is set to pending again. (Is this correct?) That's how I understand the documentation for scaling orion. Only throttling is not clear to me in this case.

Architecturewise, I use docker with replicas for scalling CB and quantumleap. Hence, I use the docker DNS for distributing the load. --> The notification is always send to the same URL and Docker does the rest.

I don't think the mongoDB schould not make any difference but I can also downgrade for testing. Would anything of this: https://docs.mongodb.com/manual/release-notes/4.0-compatibility/ make a difference for Orion

I can also provide a sketch later in the day if still needed.

fgalan commented 5 years ago

I can also provide a sketch later in the day if still needed.

Yes, please. I think it would help to clarify the scenario. Once you do it, I'll read in deep your comment and provide feedback.

tstorek commented 5 years ago

Hi again,

sorry, for not coming back to this for such a long time. I still owe a sketch of setup. So here they are: Setup:

platform_setup

dataflow:

dataflow

For the number of replicas I added the numbers. The whole platform runs on a 1-node docker swarm. That is also the reason why we did not replicate the databases yet.

The openMuc Modul is a middleware that only ensures the communication with BACnet. For the platform it simply looks as if there were UL-iot-devices sending data on their topics.

I think this issue is also related to 383 in iot-agent-ul https://github.com/telefonicaid/iotagent-ul/issues/383

Hence, is there any solution but tuning? I am not quite sure if tuning would solve the problem in general. I hope the skeches help for better understanding.

Cheers

fgalan commented 5 years ago

Thanks for the diagram. Some additional pieces of information I'd need:

Could you also provide an example of entity and an example of subscription (as provided in the API, i.e. GET /v2/entities/{entityId} and GET /v2/subscriptions/{subId}, not in DB)?
How many entities do you have at the end?
How many subscriptions do you have at the end?

(Looking in the issue comments above, I'm afraid it's not fully clear to me... maybe I'm not remembering correctly given the time has passed ;)

In the meanwhile, I'll try to provide some answers to some of the points mentioned above:

ONE: Would threadpooling really help here?

My suggestion is to test with the three notifications modes (transient, permanent and threadpool), evaluate results and use the most convenient for your case.

TWO: What would be the influence of throttling here. Does it account for the subscription pattern or for each entry that matches the pattern individually otherwise it's use together with patterns would be not clear to me.

Throttling is evaluated per-subscription. In addition, it has some limitations in multi-CB scenario as I understand yours is. From https://fiware-orion.readthedocs.io/en/master/user/ngsiv2_implementation_notes/index.html#notification-throttling:

In addition, Orion implements throttling in a local way. In multi-CB configurations, take into account that the last-notification measure is local to each Orion node. Although each node periodically synchronizes with the DB in order to get potentially newer values (more on this here) it may happen that a particular node has an old value, so throttling is not 100% accurate.

In general, we don't recommend to use throttling (versus implement notification flow control in the receiver element).

THREE: With the execution of a subscription I mean the execution of the notification mechanism. I assume that whenever an update is recognized one of my CB executes the notification mechanism and updates the lastNotification attribute. Thus, the other CB do not execute it anymore until the state for for subscriptions is set to pending again. (Is this correct?) That's how I understand the documentation for scaling orion. Only throttling is not clear to me in this case.

As far as I remember, there isn't any "pending" status for subscriptions (status are "active", "inactive",k "expired", "oneshot", etc.). A good description on how subscriptions and notification work in multi-CB scenarios is described here: https://stackoverflow.com/questions/43857300/what-would-be-the-behavior-of-subscriptions-and-notifications-in-an-orion-load-b/43873643#43873643 (if you find useful, please provide a +1 on it, so it can go higher in SOF/Google searches and be useful to more users :)

FOUR: I think this issue is also related to 383 in iot-agent-ul telefonicaid/iotagent-ul#383

I had a look to it and provide some feedback on that isssue.

telefonicaid / fiware-orion

Orion responses with stack hang up and subscriptions with patterns #3519