jina-ai / dalle-flow

🌊 A Human-in-the-Loop workflow for creating HD images from text
grpcs://dalle-flow.dev.jina.ai
2.83k stars 209 forks source link

Docker error #23

Closed giux78 closed 1 year ago

giux78 commented 2 years ago

I have this problem running docker with command: docker run -p 51005:51005 -v $HOME/.cache:/root/.cache --gpus all jinaai/dalle-flow on aws g5.xlarge with Deep Learning AMI (Amazon Linux 2) Version 61.2

I have done lot of experiment fixing many small problems but now I'm not able to figure out how to go forward. In any case thanks a lot for this wonderful project

SERVER: Done. 0:0:0 device count: 1 DEBUG dalle/rep-0@12 start listening on 0.0.0.0:60127 [05/23/22 08:25:29] DEBUG dalle/rep-0@ 1 ready and listening [05/23/22 08:25:29]

╭───── 🎉 Flow is ready to serve! ──────╮ │ 🔗 Protocol GRPC │ │ 🏠 Local 0.0.0.0:51005 │ │ 🔒 Private 172.17.0.2:51005 │ │ 🌍 Public 54.161.222.71:51005 │ ╰───────────────────────────────────────╯

DEBUG gateway/rep-0@18 GRPC call failed with code [05/23/22 08:26:08] StatusCode.UNAVAILABLE, retry attempt 1/3. Trying
next replica, if available.
DEBUG gateway/rep-0@18 GRPC call failed with code
StatusCode.UNAVAILABLE, retry attempt 1/3. Trying
next replica, if available.
DEBUG gateway/rep-0@18 GRPC call failed with code
StatusCode.UNAVAILABLE, retry attempt 2/3. Trying
next replica, if available.
DEBUG dalle/rep-0@12 recv DataRequest at / with id: [05/23/22 08:26:08] 3dfaebf6ef3e49a3977dd7dfe9eb6b27
DEBUG gateway/rep-0@18 GRPC call failed with code
StatusCode.UNAVAILABLE, retry attempt 2/3. Trying
next replica, if available.
DEBUG gateway/rep-0@18 GRPC call failed, retries exhausted
DEBUG gateway/rep-0@18 GRPC call failed, retries exhausted
ERROR gateway/rep-0@18 Error while getting responses from [05/23/22 08:26:08] deployments: failed to connect to all addresses
|Gateway: Communication error with deployment at
address(es) 0.0.0.0:50029. Head or worker(s) may be
down.

CLIENT ERROR GRPCClient@6813 gRPC error: StatusCode.UNAVAILABLE failed to connect to all addresses |Gateway: [05/23/2022 08:26:08 AM] Communication error with deployment at address(es) 0.0.0.0:50029. Head or worker(s) may be down.
The ongoing request is terminated as the server is not available or closed already.

spuliz commented 2 years ago

I have the same issue 😞 spent 3 days trying to fix it but with no luck..

JoanFM commented 2 years ago

Hey @giux78 , are u sure the docker is able to load all the Executors? Make sure to give enough resources, you can try first to have a lower number of replicas for each Executor

nathanmargaglio commented 2 years ago

Seeing the same kind of issue on an AWS g5.xlarge instance running Deep Learning AMI (Ubuntu 18.04) Version 61.0. GPUs appear to be accessible from within Docker, CUDA versions all match (11.6), and rebuilding with various configuration changes has no affect.

@JoanFM Care to explain what you mean in more detail? The g5.xlarge should have enough resources, so if there's something else that needs to be configured, it's not clear from the README.

Edit: I'll also add that I've tried this with Amazon Linux 2 Deep Learning AMIs without luck as well. I wasn't able to take them as far as the Ubuntu image before running into issues, though, so I figured the Ubuntu image is more suitable.

Edit: Although it may not be useful for troubleshooting, I want to also add that installing DALLE-Flow directly on the server has also been a failure. This is true for both Ubuntu and Amazon Linux 2 AMIs, as well as p2 instances.

I'm no deep learning guru, but I've never had this much issue with deep learning in AWS/Docker before, so imagine there are some specifics that could be pointed out to make this process easier. Perhaps someone who has successfully set this up in AWS can update the docs with some more specific instructions?

hanxiao commented 2 years ago

Did you try building docker and run it via docker container? I just rebuild and run without any issue.

https://github.com/jina-ai/dalle-flow#run-in-docker

git clone https://github.com/jina-ai/dalle-flow.git
cd dalle-flow

docker build --build-arg GROUP_ID=$(id -g ${USER}) --build-arg USER_ID=$(id -u ${USER}) -t jinaai/dalle-flow .

docker run -p 51005:51005 -v $HOME/.cache:/home/dalle/.cache --gpus all jinaai/dalle-flow
nathanmargaglio commented 2 years ago

@hanxiao

Yup, tried it a number of times, double checking that I'm doing everything correctly. I can get it to build and run fine, and the GPU appears to be accessible from within the running container (i.e., by using nvidia-smi in the container). But every time I try to communicate with it from a client (i.e., following the Google Colab example), I'd get some variation of:

CLIENT ERROR GRPCClient@6813 gRPC error: StatusCode.UNAVAILABLE failed to connect to all addresses |Gateway: [05/23/2022 08:26:08 AM] Communication error with deployment at address(es) 0.0.0.0:50029. Head or worker(s) may be down. The ongoing request is terminated as the server is not available or closed already.

(like the original issue creator). I noticed that the address it would provide would be different on different runs, but if I recall correctly, the other addresses would eventually appear before the whole thing quits. And just to be clear: it gets to a point where it appears to be running correctly, i.e., presenting 🎉 Flow is ready to serve! and showing the URLs. It's just as soon as I make a request to it does it show these errors. I wouldn't be surprised if there's some network error I'm missing, but I don't think there's much to do on my end other than exposing the 51005 port.

I'm sure I'm doing something incorrectly, but I've gone through the process of spinning up new an EC2 instance with a new EBS volume, building the Docker image, running it and getting an error, troubleshooting, and retrying with some small variation at least a dozen times (with different instance types, AMIs, EBS options, etc.), so I've exhausted my capacity to troubleshoot this thing further with regards to the infrastructure. I'd be willing to troubleshoot what's happening in the container to cause these errors instead, but without some specific direction I wouldn't have the time to open that can of worms.

Is there any more details you can provide on the setups you've had success with? I saw in the other issue that you had success with a p2.8xlarge instance (which I believe I tried as well), so knowing the details of such a setup would probably point out what I'm doing wrong. I think I've tried every Deep Learning/GPU AMI available for both Amazon Linux 2 and Ubuntu (with varying degree of success), but if you're using a different one, please let me know.

hanxiao commented 2 years ago

closing for now as we are trying to provide an auto-build docker image in next few hours. feel free to open the issue if the new image still doesn't work.

AntonyLeons commented 2 years ago
Screenshot_6

may want to reopen this issue, latest docker image with. this is however under wsl2

hanxiao commented 2 years ago

@AntonyLeons from the log the start of the service is successful, and it looks like receiving requests, handling requests, done requests all went well. Did you get the returned result on the client side?

mohammedalsayegh commented 2 years ago

I run Docker WSL2. There is an issue from the client side with the message the server is not available or already closed.

image

I don't have any issue with connecting to 'grpc://dalle-flow.jina.ai:51005'.

image

And, there is no indication of an error message from the server side.

image
AntonyLeons commented 2 years ago

Yep, this is the error I got so this is caused by an out of memory issue but this is system memory not vram. This happen to me with 16GB allocated. 24gb seems to work for me, still complains about out of memory though.


From: Mohammed H. Alsayegh @.> Sent: Friday, June 24, 2022 9:59:57 PM To: jina-ai/dalle-flow @.> Cc: Antony Leons @.>; Mention @.> Subject: Re: [jina-ai/dalle-flow] Docker error (Issue #23)

I run Docker WSL2. There is an issue from the client side with the message the server is not available or already closed.

[image]https://user-images.githubusercontent.com/40126750/175666510-088169e2-7f8c-4e4b-8b19-6490b33e8978.png

I don't have any issue with connecting to 'grpc://dalle-flow.jina.ai:51005'.

[image]https://user-images.githubusercontent.com/40126750/175666140-be353e2a-3ef6-4e2d-94d7-169634cd6f54.png

And, there is no indication of an error message from the server side.

[image]https://user-images.githubusercontent.com/40126750/175666010-12897ce2-662c-4fef-b409-981bce1eefb0.png

— Reply to this email directly, view it on GitHubhttps://github.com/jina-ai/dalle-flow/issues/23#issuecomment-1165934318, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AF4NQZPQA6357VMIZOR6KMLVQYOU3ANCNFSM5WVA3D5A. You are receiving this because you were mentioned.Message ID: @.***>

mohammedalsayegh commented 2 years ago

Yep, this is the error I got so this is caused by an out of memory issue but this is system memory not vram. This happen to me with 16GB allocated. 24gb seems to work for me, still complains about out of memory though. ____ From: Mohammed H. Alsayegh @.> Sent: Friday, June 24, 2022 9:59:57 PM To: jina-ai/dalle-flow @.> Cc: Antony Leons @.>; Mention @.> Subject: Re: [jina-ai/dalle-flow] Docker error (Issue #23) I run Docker WSL2. There is an issue from the client side with the message the server is not available or already closed. [image]https://user-images.githubusercontent.com/40126750/175666510-088169e2-7f8c-4e4b-8b19-6490b33e8978.png I don't have any issue with connecting to 'grpc://dalle-flow.jina.ai:51005'. [image]https://user-images.githubusercontent.com/40126750/175666140-be353e2a-3ef6-4e2d-94d7-169634cd6f54.png And, there is no indication of an error message from the server side. [image]https://user-images.githubusercontent.com/40126750/175666010-12897ce2-662c-4fef-b409-981bce1eefb0.png — Reply to this email directly, view it on GitHub<#23 (comment)>, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AF4NQZPQA6357VMIZOR6KMLVQYOU3ANCNFSM5WVA3D5A. You are receiving this because you were mentioned.Message ID: @.***>

I am running it on an M40 with 24GB VRAM as secondary graphic card and 128GB system memory. That should be sufficient as no other tasks are taking place.

I'm running it in WDDM mode, and it appears correctly on the sub-system

image
mohammedalsayegh commented 2 years ago

I think it was a mistake on my part. Under Python windows, I had set the local IP to 0.0.0.0, which does not echo as a local address and result with transmit failed. By changing it to 127.0.0.1, it works.

image

It took 6 minutes with M40

image
AshishSardana commented 1 year ago

using grpc://127.0.0.1:51005 instead of grpcs:// can also help in resolving network protocol related errors.

delgermurun commented 1 year ago

I believe this issue has been resolved. Feel free to reopen if the problem occurs again.