nlmatics / llmsherpa

Developer APIs to Accelerate LLM Projects
https://www.nlmatics.com
MIT License
1.37k stars 134 forks source link

Azure MarketPlace cannot provide stable service #38

Open lan2720 opened 9 months ago

lan2720 commented 9 months ago

We have recently deployed the LLMSherpa service through Azure Marketplace and are experiencing an intermittent issue. Specifically, the API suddenly stops responding to any requests, failing to return results. This issue persists even after restarting both the Azure machine and the service itself. However, we have noticed that the service spontaneously resumes normal operation the following day. Additionally, we are unable to log into the deployed machine to check if the service is functioning normally, and we cannot access any logs.

Has anyone else encountered similar issues with services deployed via Azure Marketplace? If so, could you please share any insights or solutions to this problem? Additionally, we are interested in knowing if there are alternative, more stable API services available that we could consider.

Thank you for your assistance.

ansukla commented 9 months ago

Thanks for the note. We will look into this and get back to you ASAP.

Regards, Ambika

On Mon, Dec 25, 2023 at 2:13 AM 王嘉楠 @.***> wrote:

We have recently deployed the LLMSherpa service through Azure Marketplace and are experiencing an intermittent issue. Specifically, the API suddenly stops responding to any requests, failing to return results. This issue persists even after restarting both the Azure machine and the service itself. However, we have noticed that the service spontaneously resumes normal operation the following day. Additionally, we are unable to log into the deployed machine to check if the service is functioning normally, and we cannot access any logs.

Has anyone else encountered similar issues with services deployed via Azure Marketplace? If so, could you please share any insights or solutions to this problem? Additionally, we are interested in knowing if there are alternative, more stable API services available that we could consider.

Thank you for your assistance.

— Reply to this email directly, view it on GitHub https://github.com/nlmatics/llmsherpa/issues/38, or unsubscribe https://github.com/notifications/unsubscribe-auth/AALJTIRAVMA53J4LFSSJITLYLEYRPAVCNFSM6AAAAABBCE4W46VHI2DSMVQWIX3LMV43ASLTON2WKOZSGA2TKNJQGA3DONA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

lan2720 commented 9 months ago

@ansukla Thank you for your concern. Our Standard E4ds v4 server (4 vCPU, 32 GiB RAM, Ubuntu 20.04) becoming unresponsive after several thousand TCP connections, despite using a PoolManager connection pool client-side. It seems we're hitting a TCP connection limit. Could this be the case? Are there default TCP connection limits for this server type, and how can they be adjusted?

Also, is there any server-side mechanism to automatically manage these connections? Any guidance to resolve this bottleneck would be highly appreciated.

kiran-nlmatics commented 9 months ago

@lan2720, TCP Connection limit could happen only if the number of OPEN connections exceed a limit. One another possibility is that client while tearing down the connection, did not do it properly and the connection in server end stayed in ESTABLISHED State till the TCP Keep Alive timeout mechanism kicks-in (default of 2 hours). This could relate to your observation that the service became responsive after sometime.

However, the following statement makes me wonder - "This issue persists even after restarting both the Azure machine and the service itself.". With a system reboot all the connections should get reset.

lan2720 commented 9 months ago

@kiran-nlmatics Only on December 22 night, our CVM failed to recover after a restart, a deviation from past and later instances where restarts resolved similar issues. The cause remains undetermined, suggesting possible CVM faults. Restarting works for most of time.

lan2720 commented 9 months ago

@kiran-nlmatics I also want to determine the proper method for closing the connection. I noticed that you receive responses from the API as follows:

self.api_connection = urllib3.PoolManager()
parser_response = self.api_connection.request("POST", self.parser_api_url, fields={'file': pdf_file})
return parser_response

There aren't any special lines for this. I follow the same approach.

lan2720 commented 9 months ago

@kiran-nlmatics I also want to determine the proper method for closing the connection. I noticed that you receive responses from the API as follows:

self.api_connection = urllib3.PoolManager()
parser_response = self.api_connection.request("POST", self.parser_api_url, fields={'file': pdf_file})
return parser_response

There aren't any special lines for this. I follow the same approach.

I also set PoolManager(num_pools=1, maxsize=1, block=True) for each of my service processes. However, the problem of sudden unresponsiveness still exists.

kiran-nlmatics commented 9 months ago

@lan2720 I tried to recreate the issue on a similar AMA setup, wherein I created a single process pumping a couple thousand of Parser requests to the server endpoint. I couldn't find a bottleneck in the server. There were no sockets lingering in ESTABLISHED State, also the sockets which are in TIME_WAIT will return back to the system pool after a period of 60 sec, again a system default.

Could you provide a bit more detail on your setup? e.g.

  1. After approximately how many requests are you experiencing this issue?
  2. Is the ingress traffic to the VM, originating from a single endpoint?

Next time when the issue resurfaces, it would be great, if you could check the output of this request. curl -X 'GET' 'http://<server_ip>:5000/api/healthz' -H 'accept: text/plain' -m 1

lan2720 commented 9 months ago

@kiran-nlmatics For your questions:

  1. The number of requests it takes before the service becomes unresponsive varies. I've experienced occasions where the machine, after a reboot, handled over 10,000 requests without any issues. However, there have also been instances where it started showing errors after around 5,000 requests post-reboot.

  2. The Azure machine we're using is solely dedicated to hosting the llmsherpa service; no other services are deployed on it. From the Azure backend monitoring, we can only see the Network In/Out Total, measured in MiB, without much other detailed information. Additionally, the client IPs we use are limited to a few fixed ones, not exceeding five.

I run the command and give my Azure VM endpoint, the output is:

curl: (28) Operation timed out after 1001 milliseconds with 0 bytes received

The command telnet <server_ip> 5000 successfully establishes a connection.

And I also tried your default endpoint: curl -X 'GET' 'http://readers.llmsherpa.com/api/healthz' -H 'accept: text/plain' -m 1,

<html>
<head><title>308 Permanent Redirect</title></head>
<body>
<center><h1>308 Permanent Redirect</h1></center>
<hr><center>nginx</center>
</body>
</html>
lan2720 commented 9 months ago

@kiran-nlmatics @ansukla When using PoolManager to call the default endpoint(readers.llmsherpa.com) and then calling the Azure CVM's IP:port, We observed (from my own Mac) through netstat -an | grep <ip>. We noticed that with the default endpoint, the local TCP port number remains stable and changes only after a considerable amount of time, indicating successful reuse of the connection pool. However, when using Azure's endpoint, the local TCP port number changes frequently, suggesting that the connection pool is not being reused effectively.

We suspect that Azure CVM does not support Keep-Alive and might have a firewall policy that blocks frequent requests from a fixed IP. We would like to modify these settings, but we couldn't find the appropriate options in the Azure management console. Do you have any insights on how to adjust these configurations?

lan2720 commented 9 months ago

kiran-nlmatics commented 9 months ago

@lan2720, We also noticed the incremental port usage when connected to the Azure VM. I believe that this could be due to the very nature on how NAT-ing works in AVM with a local IP and the Public IP associated with it. That being said, the incremental port usage should not become an issue if the ports are cleared and is ready to be served soon. We are working on a potential fix and will update you shortly on this.

lan2720 commented 8 months ago

@kiran-nlmatics Hi any progress so far? We are eagerly anticipating the resolution of this issue, as it would greatly assist our services.

kiran-nlmatics commented 8 months ago

Hi @lan2720, We are in the process of releasing a new version of the Managed Application and is undergoing Azure verification at this moment. Should be available in a day or two.

kiran-nlmatics commented 8 months ago

@lan2720 The new version is available in Azure Market Place. Please let us know how it goes.

lan2720 commented 8 months ago

@kiran-nlmatics Great work and Thank you! We will try it right away and give you feedback.

lan2720 commented 8 months ago

@kiran-nlmatics We encountered the following error while deploying a new service, and we have also reached out to Azure technical support. However, their resolution process is usually slow, so we would like to consult with you to see if you can offer any assistance.

img_v3_0270_ce63b8e7-8992-4856-8b91-2fc46c5a551g img_v3_0270_715d0b00-2595-4793-ba22-37f4d6ff456g img_v3_0270_f7dac6e9-de4f-4141-bdf0-eb3395c9b05g

kiran-nlmatics commented 8 months ago

@lan2720, I am not sure on this error code. A rather brute-force approach I can think of is to manually delete the previous AMA and then create again?

ansukla commented 8 months ago

@Ian2720 The server is now fully open source. See instructions here to create and self host your own private server: https://github.com/nlmatics/nlm-ingestor. The public open server and private paid server will not be updated with latest code and will be stopped eventually.