Connection being reset by peer

jpbalarini commented 6 months ago

I'm running the nlm-ingestor on a pipeline where I'm processing thousands of total documents (~100 in parallel). I created multiple nlm-ingestor services behind a load balancer to distribute the load. But even if I create a lot of services, I get this randomly by llmsherpa:

{"reason":"('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))","status":"fail"}

And this is the stack trace:

...
File "/home/ubuntu/.local/lib/python3.10/site-packages/llmsherpa/readers/file_reader.py", line 74, in read_pdf
    blocks = response_json['return_dict']['result']['blocks']
KeyError: 'return_dict'

So what is happening is that for some reason the nlm-ingestor service drops the connection (maybe there are too many?) and llmsherpa doesn't get a proper response_json with a return_dict value.

Have you encountered this issue? Any idea on how to properly debug what could be happening? Thanks!

ansukla commented 6 months ago

Yes, I think a good idea will be to front the service with nginx or something similar. I believe @kiran-nlmatics faced it in our public server and solved it. This happens due to connection exhaustion in flask. This will need some work and time to push to the repo, but in the meantime you can create your own installation setup with nginx and gunicorn backend. Also adding @kiran-nlmatics to the thread.

kiran-nlmatics commented 6 months ago

I observed a similar issue with our Azure Market place offering earlier, in that randomly, there are connection drops from the client while a PDF was getting parsed and the gunicorn backend was getting stuck in a spawning loop because of the very nature on how NAT-ing was happening in these VMs. As suggested by @ansukla if you create the multiple server loads behind a loadbalancer which can reverse notify the server (gunicorn or similar backend?), the issue will not occur.

In our FREE Server (a K8S cluster with an appropriate load balancer), we never faced this specific connection reset issue.

jpbalarini commented 6 months ago

Thanks @kiran-nlmatics. Can you share the specs of that K8S cluster? Like CPUs, Memory that you were using for the free endpoint? It might be useful to our use case.

nlmatics / nlm-ingestor

Connection being reset by peer #5