Closed UncleDinoso closed 2 years ago
we haven't observed anything similar and the information you provided is not allowing us to reproduce the issue (see the issue template).
it could be that this is related to something in your cluster configuration. basically either some service in k8s or timescale interrupts the connection. while, we could have a retry mechanism in place, to mitigate the issue, the major problem here is to avoid that in first instance connections to the db are constantly cut.
hy, thank you for your reply. We have seen it in two different environments but i will provide a issue with the template.
hy, thank you for your reply. We have seen it in two different environments but i will provide a issue with the template.
did you try different versions of ql?
are you using other services to connect to timescale? can you check in the timescale log what happens to connections? are you using patroni or other psql clustering that could attempt load balancing?
Because we use ql in an productive environment we dont want to switch different versions. We use the 0.8.1 Version with Helm chart. I think the problem are network issues into the cluster which results in an broken pipe error to the timescaledb and ql are not resilient enough to handle them and retry. Maybe improvement of the exception handling possible? Mabe we have to switch to CrateDB if ql supporting it better.
as said, having some exception handling is possible, but it's only a mitigation. network should be more stable, if the error occurs often, the overhead of mitigating the infrastructure failure is going to compromise performances.
@c0c0n3 is the expert, but you are free to open a pr adding a recover logic for the issue. all attempts to write to the db are protected with a connection guard: https://github.com/orchestracities/ngsi-timeseries-api/blob/66c57d33a550b431432adfb40f20e8a4ec0d7730/src/translators/timescale.py#L127
that triggers a sql error handler: https://github.com/orchestracities/ngsi-timeseries-api/blob/66c57d33a550b431432adfb40f20e8a4ec0d7730/src/translators/timescale.py#L119
that based on the identified error: https://github.com/orchestracities/ngsi-timeseries-api/blob/66c57d33a550b431432adfb40f20e8a4ec0d7730/src/translators/errors.py#L56
may perform a recovery action or just discard.
you could extend the analyser to identify the network glitch a trigger a retry.
another option (but I am not sure it was tested with this type of error), is configuring QL to use async mode, in this case, there should be already an automatic retry mechanism for failed payload that are stored in the queue.
Hi @UncleDinoso,
Thanks for the heads up about BrokenPipe
errors. We actually experienced that in one of our commercial Quantum Leap deployments too which is why we've tried improving the exception handling code in Quantum Leap as @chicco785 pointed out. Surely we could do more and in fact we're looking to replace the pg8000
driver with Psycopg
which is supposed to pool connections and so be able to recover sort of automatically from a network failure---i.e. what most likely caused your broken pipe error too.
But keep in mind if you only have one master db process and a flaky network (that's the kind of situation we had in that commercial project I mentioned earlier), no matter how reliable we try to make Quantum Leap's data ingestion procedure, you could still wind up in a situation where alot of data can't be saved to the DB---e.g. think of a DB connection pool where all the connections get retried and eventually the pool depleted b/c the network is still down or the DB master process is overloaded and can't accept any more connections.
Like @chicco785 said, we've got a solution for this kind of scenarios but it involves deploying an extra Quantum Leap process acting as a work queue consumer so failed data inserts can be scheduled for retry and this should also help reduce load on the DB server---which could be another reason for the broken pipe errors. Unfortunately at the moment we don't have Helm charts for this kind of deployment, we'll eventually put them together, but for the moment we welcome PRs :-)
Hy @all, thanks for your replies! I will discuss them with my team. The issue can be closed for now.
thanks, we are curious about your use case, and if you are interested, we are happy to add your company among the users. https://github.com/orchestracities/ngsi-timeseries-api#adopters
Hy, we got a error in quantum leap in different environments and dont know how to fix them
Environment: Azure and Nutanix Kubernetes, TimescaleDB
Error: Broken Pipe see Sreenshot (Sorry for not posting code, the log is gone)