telefonicaid / fiware-orion

Context Broker and CEF building block for context data management, providing NGSI interfaces.
https://github.com/telefonicaid/fiware-orion/blob/master/doc/manuals/orion-api.md
GNU Affero General Public License v3.0
210 stars 265 forks source link

BUG Sometimes using MongoDB as ReplicaSet and change the primary node when exist on-air operations fail #3717

Closed cesarjorgemartinez closed 3 years ago

cesarjorgemartinez commented 4 years ago

Hi,

Sometimes using MongoDB as ReplicaSet and change the primary node when exist on-air operations fail.

Seen with version 2.4.1.

Example logs (in WARN level):

time=2020-11-04T10:10:02.100Z | lvl=ERROR | corr=e70fa0be-1e85-11eb-a2d9-fa163e83ea20 | trans=1601392977-831-00118230617 | from=x.y.z.u | srv=cjmm | subsrv
=/MovilidadControlAccesos | comp=Orion | op=AlarmManager.cpp[211]:dbError | msg=Raising alarm DatabaseError: collection: orion-cjmm.entities - count(): { _id.id:
 "IS_CIR_007_001", _id.type: "KeyPerformanceIndicator", _id.servicePath: "/pepito" } - exception: ReplicaSetMonitor no master found for set: cb_rs0
time=2020-11-04T10:10:02.568Z | lvl=ERROR | corr=e755efe2-1e85-11eb-85a0-fa163e83ea20 | trans=1601392977-831-00118230618 | from=x.y.z.u | srv=cjmm | subsrv
=/pepito | comp=Orion | op=AlarmManager.cpp[235]:dbErrorReset | msg=Releasing alarm DatabaseError
time=2020-11-04T10:10:30.821Z | lvl=WARN | corr=f82a97b4-1e85-11eb-849c-fa163e83ea20 | trans=1601392977-831-00118233293 | from=10.0.0.36 | srv=elenita | subsrv=<
none> | comp=Orion | op=httpRequestSend.cpp[583]:httpRequestSendWithCurl | msg=Notification response NOT OK, http code: 500
time=2020-11-04T10:10:30.828Z | lvl=WARN | corr=f82a97b4-1e85-11eb-849c-fa163e83ea20 | trans=1601392977-831-00118233290 | from=10.0.0.36 | srv=elenita | subsrv=<
none> | comp=Orion | op=httpRequestSend.cpp[583]:httpRequestSendWithCurl | msg=Notification response NOT OK, http code: 500
time=2020-11-04T10:13:39.736Z | lvl=ERROR | corr=68c639a6-1e86-11eb-a988-fa163e83ea20 | trans=1601392977-831-00118240707 | from=a.b.c.d | srv=avelino | subsrv=/avelino | comp=Orion | op=AlarmManager.cpp[211]:dbError | msg=Raising alarm DatabaseError: collection: orion-avelino.entit
ies - count(): { _id.id: "ParkingAccess:10AB1EBO1", _id.type: "ParkingAccess", _id.servicePath: "/avelino" } - exception: count fails:{ operationTime: Timestamp
 1604484818|2, ok: 0.0, errmsg: "not master and slaveOk=false", code: 13435, codeName: "NotMasterNoSlaveOk", $clusterTime: { clusterTime: Timestamp 1604484818|3, signature: { hash: BinData(0, 0000000000000000000000000000000000000000), keyId: 0 } } }
time=2020-11-04T10:13:41.217Z | lvl=ERROR | corr=69a94a16-1e86-11eb-8e00-fa163e83ea20 | trans=1601392977-831-00118240709 | from=d.e.f.g | srv=cjmm | subsrv=/audifono | comp=Orion | op=AlarmManager.cpp[235]:dbErrorReset | msg=Releasing alarm DatabaseError
time=2020-11-04T10:22:39.774Z | lvl=ERROR | corr=aa9e37b0-1e87-11eb-9f36-fa163e83ea20; perseocep=1903508 | trans=1601392977-831-00118260570 | from=10.0.0.33 | srv=elenita | subsrv=/elenita | comp=Orion | op=AlarmManager.cpp[211]:dbError | msg=Raising alarm DatabaseError: collection: orion-elenita.entities - count(): { _id.id: "OffStreetParking:BO3", _id.type: "OffStreetParking", _id.servicePath: "/pct_cartuja" } - exception: socket exception [SEND_ERROR] for 10.0.0.19:27017
time=2020-11-04T10:22:39.774Z | lvl=ERROR | corr=aa9e37b0-1e87-11eb-9f36-fa163e83ea20; perseocep=1903509 | trans=1601392977-831-00118260566 | from=10.0.0.33 | srv=elenita | subsrv=/elenita | comp=Orion | op=AlarmManager.cpp[235]:dbErrorReset | msg=Releasing alarm DatabaseError

Regards, Cesar Jorge

cesarjorgemartinez commented 4 years ago

This behaviour are detected also in other components (example iota-ul)

fgalan commented 4 years ago

My first hypotehis was this was caused by a limitation in the legacy driver used by Orion, maybe solved in the new driver with a better support to on-fly operations during a primary change. Driver migration, as you maybe known, is pending (issue # 3132and main PR, in WIP status: https://github.com/telefonicaid/fiware-orion/pull/3622).

However then you say

This behaviour are detected also in other components (example iota-ul)

which may invalidate my hypthosis, as these other componentes are supposed to use up to date (or almost) drivers versions :)

cesarjorgemartinez commented 4 years ago

It may be that this hypothesis does not start from the data that are supposed to be correct. Example, other components also prevent this problem.

fgalan commented 3 years ago

MongoDB driver has been completely replaced in PR #3622. If this problem is still happening it will appear in a completely different form. Thus, I think is better to close this issue and open a new fresh one in the case that occurs.