Timeout issues - finding their way back to the application

judgej commented 6 years ago

Summary of issue If the queue connection times out, due to being connected for some time without being used, then no attempt is made to reconnect if sending a message fails. With laravel, if a database connection times out, then laravel attempts to detect this, reconnects, then retries the query one more time.

The complete error message, including file & line numbers

I have a model observer that watches for certain events, and dispatches a job to handle certain types of event. Occasionally updating the model will throw this error (returned by eloquent->save() when the observer is triggered:

cURL error 7: Failed connect to myresourcegroup.queue.core.windows.net:443; Connection timed out (see http://curl.haxx.se/libcurl/c/libcurl-errors.html)

Copy of your config/queue.php file

Software versions This Package: 5.5.2 Laravel: 5.6 PHP: 7.2 Operating System: Windows / Linux / MacOS: Linux

Any other information

While the connections are up, everything works fine, but it is the timeouts that are the problem, both when sending on a timed out connection (expecting an auto-reconnect) and when listening on a long-running process.

My queue workers are restarted every 10 minutes to try to avoid the queue listener timing out. When it does time out, the MS Azure library either throws an exception or passes a message with the timeout error details (not sure which, but I think it is the latter). When this happens, the listener is sat listening to a dead connection, and so the queues are not served. That timeout should also be caught if possible and a reconnect done. The queue sender can afford to allow the connection to drop, so long as it can restart it when needed (like Laravel database connections). But the queue listener must always make sure its connection is up, running, and open.

judgej commented 6 years ago

This is the type of message the queue worker gets (sometimes) when the queue connection drops:

Fail: Code: 500 Value: Operation could not be completed within the specified time. details (if any): ï»¿\<?xml version="1.0" encoding="utf-8"?>\<Error>\<Code>OperationTimedOut\</Code> \<Message>Operation could not be completed within the specified time. RequestId:789e0000-0000-0036-59dd-d70517000000 Time:2018-04-19T12:51:27.6988826Z\</Message>\</Error>.

I think that is handed over to Laravel as a queue message, which Laravel then attempts to run as a job (which it shouldn't, but Laravel's lack of validation of queue messages is another bugbear of mine). This locks up the queue worker, which can only be killed from the Linux command line. Not sure what it's doing - blocking while waiting for its non-existent child worker to finish, I suspect.

Anyway, I need to do some more tests of long-running connections, but if what I think is happening is correct, then inspecting the messages that the Azure queue library gives to this connector, and looking for some text that indicates the connection has dropped, would allow it to reconnect and not pass the message back to Laravel as a job.

"Fail:\nCode: %s\nValue: %s\ndetails (if any): %s." == AZURE_ERROR_MSG == ServiceException

I guess "details" is the raw XML data returned by the API, and "Value" is the human-readable response phrase.

squigg commented 6 years ago

Thanks for the report - I'll take a look into it this weekend.

judgej commented 6 years ago

Let me know what I can do to help. There are a lot of pins set up in the chain here, and some things just have to be left to run for a few hours before the problem is observed. So if there is specific points that it would be helpful for me to capture data, I can do that.

judgej commented 6 years ago

I also think this is something that Laravel should take on board to a certain extent, perhaps allowing the connector to return a "connection lost" exception so that Laravel can ask for a reconnection before trying again. Or something like that.

judgej commented 6 years ago

Noticed the Azure PHP storage libraries have all been taken up a point release in the last day. Not sure if the new version tackles any of the timeout/auto-reconnect issues.

squigg commented 6 years ago

I've had a look into this, and with what you are seeing it appears there may be two different scenarios happening:

cURL is simply failing to connect to the Azure service:
- cURL raises an exception for error 7 "Couldn't connect"
- GuzzleHttp raises a ConnectException for this, setting the message to the format cURL error %s: %s (%s)
- Within the Azure PHP Storage library the promise is rejected, the HTTP response is null as it never connected, so the library re-throws the exception which makes its way up to your top error handler.
The Azure service is returning an HTTP 500 error response for OperationTimedOut
- cURL completes the request and return the HTTP 500 error
- GuzzleHttp completes the promise with the 500 error and HTTP response body
- Azure PHP Storage library is not expecting a 500 response from the server, and raises a ServiceException (see throwIfError in ServiceRestProxy)
- Azure PHP Storage library formats the Exception message to the format Fail:\nCode: %s\nValue: %s\ndetails (if any): %s.
- Laravel receives this Exception in the Worker getNextJob method, which is then caught and reported. As the error message does not match any of the strings in causedByLostConnection, the worker does not quit and keeps running.

The Laravel workers operate on a polling frequency rather than maintaining open connections, however cURL could potentially be holding connections open under the hood, and the default polling of every 3 seconds could be sufficient to keep a connection open. Laravel already tries to re-attempt what it understands to be lost connections for queues, but this doesn't cover this specific scenario.

Options I see are:

I can handle the specific error in the pop method in the AzureQueue and re-throw an Exception that Laravel understands, which will cause it to kill the Worker and terminate the connection
I can submit a PR to the Laravel Framework to get the messages added into causedByLostConnection method, which will have the same effect as Option 1

I'm not sure either will fix your first error, as that seems like a standard connection error. This could be Azure refusing the connection due to too many concurrent requests, or just an occasional network issue.

squigg commented 6 years ago

Also, I am not 100% sure whether the latter HTTP 500 error is actually a formal timeout / connection lost, so may get some pushback from the Laravel maintainers if it is just a transient error where the Azure Service itself just timed out and failed.

judgej commented 6 years ago

Thanks, that's some great analysis, very much appreciated.

I think you may be right about connections being held open. If Laravel is polling the queue every few seconds, and it is opening a new connection each time, then if connections are not closed properly we would certainly have a problem building up. Without restarting the queue worker, it would take anywhere between four and six hours before it effectively freezes. I could kill the workers (I have three of them) with a akill signal, but artisan queue:restart would not work, so I guess they were blocking on something - waiting for a spare connection slot to be freed up perhaps? I'm only guessing, with limited knowledge on the Azure side.

The long delays I was seeing in pushing to the queues could very well be a symptom of the same problem. If the queue workers have taken all available connection slots, then opening a connection to push a message could be blocking (somewhere in the route to the queue). Then we end up with a kind of deadlock.

If this is what is happening, then restarting the queue workers every ten minutes, which I am doing now, is probably the best workaround for now.

So, the options. I personally have an issue with Laravel's single list of "connection lost" messages for the database connections, that are hard-coded and need to cater for a wide range of databases. IMO those lists should be in the individual database connectors, and should be extendable to provide east fixes for specific cases, and the ability to add messages in other languages (I cannot fathom how the non-English locales are coping with lost connections). So the queue workers also support lost connection handling? I wasn't aware of that. If so, then keeping it in this queue connector makes more sense to me (option 1). This connector knows about Azure, and should be telling the Laravel framework what to do :-)

I'm not an Azure export, but I can get someone else to monitor our connection pool. If they are growing every few seconds, then that would give us some clues for what needs to be addressed in the longer term.

Looking at the Azure code last week, it seemed that just about any failure to talk to the remote queue resulted in a 500 exception. I suppose it is reasonable to always rethrow this as "connection lost", as a restart of the queue worker is likely to clear out any crap that has been building up by restarting the process.

judgej commented 6 years ago

Just looking at the lost connection detection in illuminate/queue, and realise it just uses the database lost connection detection:

https://github.com/illuminate/queue/blob/9c063c804f6ab0596c68c41830496029fbca31f7/Worker.php#L9

That's a bit WTF, TBH. It may be relevant for the database queue connector, and only then for error messages that it knows about (it lacks many) but is completely inappropriate for non-database connectors. This is why there will be push-back on adding to that list - it's a list of database errors.

michliu commented 4 years ago

I had the same problem and wanted to know how to solve it.

samm 500 time out error ~~

Fail: Code: 500 Value: Operation could not be completed within the specified time. details (if any): <?xml version="1.0" encoding="utf-8"?>OperationTimedOutOperation could not be completed within the specified time. RequestId:ffa1d05b-c003-00c1-205f-105061000000 Time:2020-04-12T00:14:23.0928233Z.

錯誤檔案： /home/site/wwwroot/vendor/microsoft/azure-storage-common/src/Common/Internal/ServiceRestProxy.php

squigg / azure-queue-laravel

Timeout issues - finding their way back to the application #6