jina-ai / dalle-flow

🌊 A Human-in-the-Loop workflow for creating HD images from text
grpcs://dalle-flow.dev.jina.ai
2.83k stars 211 forks source link

"Diffusion" executer not starting #80

Closed knorr3 closed 2 years ago

knorr3 commented 2 years ago

Hello, I am trying to run the dalle-flow server container on OpenShift with an Nvidia A40 GPU. After waiting 10 minutes for dalle and diffusion, the diffusion executer terminates with a timeout. I have tried the latest container image and also tried installing the latest version manually inside the container image. Another problem is that there is not a single error or debug message coming from the diffusion executer.

Does anyone have the same problem, or any idea how I can debug this? Thank you very much.

delgermurun commented 2 years ago

@knorr3 Could you try adding timeout_ready: -1 config (like in dalle) to diffusion executor in flow.yml?

knorr3 commented 2 years ago

Currently trying that, thank you. But what could diffusion take so long? I thought that dalle would take ~8 mins to start up because of the download.

delgermurun commented 2 years ago

I am not sure also.

If you run using docker, it is already downloaded diffusion model weights, so it shouldn't take so long. Maybe there is still something downloading from the internet.

knorr3 commented 2 years ago

To be honest, i don't think this is the problem. I am currently at 45 minutes waiting time. The download speed should be fast enough with ~20 MByte/s :-).

delgermurun commented 2 years ago

you can try to start only diffusion by commenting out other executors and not relevant things in flow.yml. This way at least you'll have less clutter on the console. Maybe catch some error log.

knorr3 commented 2 years ago

Did that, but there's not a sinlge error message. Only some DeprecationWarnings about PIL.Nearest.

knorr3 commented 2 years ago

Found the issue. Jina requires a lot of memory (not VRAM) and ran into a OOM. Somehow OpenShift didn't tell me this, so i searched for the error in the wrong place. I increased the memory from 4GB to 32GB and now it starts.