orchestracities / ngsi-timeseries-api

QuantumLeap: a FIWARE Generic Enabler to support the usage of NGSIv2 (and NGSI-LD experimentally) data in time-series databases
https://quantumleap.rtfd.io/
MIT License
38 stars 49 forks source link

Broken Pipe Error #643

Closed UncleDinoso closed 2 years ago

UncleDinoso commented 2 years ago

Hy, we got a error in quantum leap in different environments and dont know how to fix them

Environment: Azure and Nutanix Kubernetes, TimescaleDB

Error: Broken Pipe see Sreenshot (Sorry for not posting code, the log is gone) MicrosoftTeams-image (1)

chicco785 commented 2 years ago

we haven't observed anything similar and the information you provided is not allowing us to reproduce the issue (see the issue template).

it could be that this is related to something in your cluster configuration. basically either some service in k8s or timescale interrupts the connection. while, we could have a retry mechanism in place, to mitigate the issue, the major problem here is to avoid that in first instance connections to the db are constantly cut.

UncleDinoso commented 2 years ago

hy, thank you for your reply. We have seen it in two different environments but i will provide a issue with the template.

chicco785 commented 2 years ago

hy, thank you for your reply. We have seen it in two different environments but i will provide a issue with the template.

did you try different versions of ql?

are you using other services to connect to timescale? can you check in the timescale log what happens to connections? are you using patroni or other psql clustering that could attempt load balancing?

UncleDinoso commented 2 years ago

Because we use ql in an productive environment we dont want to switch different versions. We use the 0.8.1 Version with Helm chart. I think the problem are network issues into the cluster which results in an broken pipe error to the timescaledb and ql are not resilient enough to handle them and retry. Maybe improvement of the exception handling possible? Mabe we have to switch to CrateDB if ql supporting it better.

chicco785 commented 2 years ago

as said, having some exception handling is possible, but it's only a mitigation. network should be more stable, if the error occurs often, the overhead of mitigating the infrastructure failure is going to compromise performances.

@c0c0n3 is the expert, but you are free to open a pr adding a recover logic for the issue. all attempts to write to the db are protected with a connection guard: https://github.com/orchestracities/ngsi-timeseries-api/blob/66c57d33a550b431432adfb40f20e8a4ec0d7730/src/translators/timescale.py#L127

that triggers a sql error handler: https://github.com/orchestracities/ngsi-timeseries-api/blob/66c57d33a550b431432adfb40f20e8a4ec0d7730/src/translators/timescale.py#L119

that based on the identified error: https://github.com/orchestracities/ngsi-timeseries-api/blob/66c57d33a550b431432adfb40f20e8a4ec0d7730/src/translators/errors.py#L56

may perform a recovery action or just discard.

you could extend the analyser to identify the network glitch a trigger a retry.

another option (but I am not sure it was tested with this type of error), is configuring QL to use async mode, in this case, there should be already an automatic retry mechanism for failed payload that are stored in the queue.

https://quantumleap.readthedocs.io/en/latest/admin/wq/

c0c0n3 commented 2 years ago

Hi @UncleDinoso,

Thanks for the heads up about BrokenPipe errors. We actually experienced that in one of our commercial Quantum Leap deployments too which is why we've tried improving the exception handling code in Quantum Leap as @chicco785 pointed out. Surely we could do more and in fact we're looking to replace the pg8000 driver with Psycopg which is supposed to pool connections and so be able to recover sort of automatically from a network failure---i.e. what most likely caused your broken pipe error too.

But keep in mind if you only have one master db process and a flaky network (that's the kind of situation we had in that commercial project I mentioned earlier), no matter how reliable we try to make Quantum Leap's data ingestion procedure, you could still wind up in a situation where alot of data can't be saved to the DB---e.g. think of a DB connection pool where all the connections get retried and eventually the pool depleted b/c the network is still down or the DB master process is overloaded and can't accept any more connections.

Like @chicco785 said, we've got a solution for this kind of scenarios but it involves deploying an extra Quantum Leap process acting as a work queue consumer so failed data inserts can be scheduled for retry and this should also help reduce load on the DB server---which could be another reason for the broken pipe errors. Unfortunately at the moment we don't have Helm charts for this kind of deployment, we'll eventually put them together, but for the moment we welcome PRs :-)

UncleDinoso commented 2 years ago

Hy @all, thanks for your replies! I will discuss them with my team. The issue can be closed for now.

chicco785 commented 2 years ago

thanks, we are curious about your use case, and if you are interested, we are happy to add your company among the users. https://github.com/orchestracities/ngsi-timeseries-api#adopters