[Issue]: Very slow API response in comparison to Automatic1111

blacklig commented 1 year ago

Issue Description

Hi!

First of all, thanks vladmandic for great fork of Automatic1111. Your build is generally better and I think also faster than A1111.

And now if anyone can help me with my issue:

In one task I cannot get anywhere close in performance as from A1111, it is getting response from API endpoint img2img.In A1111 I am getting response back like "instantly", I would say 100-200ms and response is getting back to me from server, on this fork it is like 1 second hang there before I get response and I really dont know why and spent some time on it and still cannot find you why..

I describe it very primitively - In one terminal window I have running instance of Stable Diffusion in this implementation Automatic and and on another window I have postman with payload. Once I send it, I see in terminal like that animation of image being generated and once it finishes (in like second and half on 3090), then it adds also another second I would say before postman gets result. (actually in script it is same result, so nothing wrong with postman). But when I do exactly the same thing with Automatic1111 instance, sending same prompt, generally on the same server settings etc, I am getting response from server almost instantly. Like once that animation shows it is finished, like after 100-200ms I am getting response back to postman or into script.

So somehow I am getting like +1s on every response from this fork and I really dont know why. Actually image generation is bit faster on this fork than on A1111, probably due to properly installed libraries etc but this issue totally kills it.

Only difference I would say is in this fork there is default working format .jpeg so response gets back in jpeg and probably there is conversion from PNG but no way it could get 1s on such small images I am using.. so even though A1111 is sending back 200kb png it is way faster than automatic with 50 kb jpeg.

Endpoint: http://127.0.0.1:7860/sdapi/v1/img2img

Payload I am sending is this:

{ "init_images": [ "/9j/4AAQSkZJRgABAQAA...." ], "resize_mode": 0, "denoising_strength": 0.75, "image_cfg_scale": 7, "inpainting_fill": 0, "inpaint_full_res": true, "inpaint_full_res_padding": 0, "inpainting_mask_invert": 0, "initial_noise_multiplier": 0, "prompt": "beba", "styles": [ "" ], "seed": -1, "subseed": -1, "subseed_strength": 0, "seed_resize_from_h": -1, "seed_resize_from_w": -1, "sampler_name": "Euler", "batch_size": 1, "n_iter": 1, "steps": 50, "cfg_scale": 7, "width": 512, "height": 512, "restore_faces": false, "tiling": false, "do_not_save_samples": false, "do_not_save_grid": false, "negative_prompt": "", "eta": 0, "s_min_uncond": 0, "s_churn": 0, "s_tmax": 0, "s_tmin": 0, "s_noise": 1, "override_settings": {}, "override_settings_restore_afterwards": true, "script_args": [], "sampler_index": "Euler", "include_init_images": false, "script_name": "", "send_images": true, "save_images": false, "alwayson_scripts": {} }

and bonus question: is it also possible to send Image as link on local drive or something like that instead of base64? I saw in webui using WS this actually is exactly like this but somehow cannot get it work via this api.

Version Platform Description

Ubuntu, RTX 3090

vladmandic commented 1 year ago

native format in memory is not png, its just raw - so converting to either png or jpeg is about the same. some params you're sending make no sense, for example do_not_save_samples is not a valid request param, its calculated from save_images. and you're sending both sampler_name and sampler_index - its one or the other. plus Euler is disabled by default (because its ooooold and pointless compared to any newer ones), so its likely causing a fallback to some other sampler - and that may cause some delay.

first thing, i'd reduce the json payload to absolute minimum and use values that are known to be good and see if that helps.

blacklig commented 1 year ago

Yeah, you are right that my payload was mess and I thought of it as well that I should probably make it bare minimum before posting here.. I basically just took what is API docs giving.. Anyway, now using this prompt:

const payload = {
    "init_images": [
        b64
    ],
    "prompt": "forest",
    "steps": 30,
    "cfg_scale": 7,
    "width": 512,
    "height": 512,
}

where b64 is basically that base64 encoded image and issue is the same. Even if I put there only like 1 step so it is basically instant on server side, still I wait around 1 second before getting that response back.

Right now I am trying to see how it goes in WebUI and also seems to me that is affected. I am just doing img to img, some simple 50kb 512x512 image, only 1 step, DMP++ 2M Karras.. takes 0.3 s in Automatic 1111 and 1.3 s in yours. See images below

I only see in that results that VAE is turned on on that instance with yours fork but I think it wont be it.. but I might even try to turn it off

vladmandic commented 1 year ago

i'll try to reproduce. in the meantime, can you start server with --debug command line flag so we can see between which ops is the time being spent?

blacklig commented 1 year ago

Very good idea to start it with debug!

So this already puts some light on what could be issues, maybe all those control nets etc although I dont have them turned on, at least I think so. . But who knows, try to trun them off somehow, but maybe just uninstall? That I wouldnt like to :) Here is like what gets out of yours implemetation after that generation with only 1 step (and VAE is turned off and it made no difference.)

15:53:54-543678 DEBUG gc: cuda {'ram': {'used': 4.95, 'total': 31.31}, 'gpu': {'used': 2.67, 'total': 23.69}, 'retries': 0, 'oom': 0} 15:53:54-544744 DEBUG img2img: task(5j4a2c73ze43ntb)|0|||[]|<PIL.Image.Image image mode=RGBA size=512x512 at 0x7F9AE11BC070>|None|None|None|None|None|None|1|0|4|0|1|False|False|1|1|6|1.5|0.75|-1.0|-1.0|0|0|0|F alse|0|512|512|1|0|0|32|0||||[] 15:53:54-547440 DEBUG Script process: Tiled Diffusion 15:53:54-548051 DEBUG Script process: Tiled VAE 15:53:54-548702 DEBUG Script process: Dynamic Thresholding (CFG Scale Fix) 15:53:54-549357 DEBUG Script process: Steps animation 15:53:54-549941 DEBUG Script process: ControlNet 15:53:54-624677 DEBUG Script before-process-batch: Tiled Diffusion 15:53:54-625458 DEBUG Script before-process-batch: Tiled VAE 15:53:54-626071 DEBUG Script before-process-batch: Dynamic Thresholding (CFG Scale Fix) 15:53:54-626726 DEBUG Script before-process-batch: Steps animation 15:53:54-627331 DEBUG Script before-process-batch: ControlNet 15:53:54-628010 DEBUG Script process-batch: Tiled Diffusion 15:53:54-628702 DEBUG Script process-batch: Tiled VAE 15:53:54-629282 DEBUG Script process-batch: Dynamic Thresholding (CFG Scale Fix) 15:53:54-629926 DEBUG Script process-batch: Steps animation 15:53:54-630514 DEBUG Script process-batch: ControlNet 100%|██████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 19.27it/s] 15:53:54-889970 DEBUG gc: cuda {'ram': {'used': 4.97, 'total': 31.31}, 'gpu': {'used': 2.67, 'total': 23.69}, 'retries': 0, 'oom': 0} 15:53:55-137135 DEBUG gc: cuda {'ram': {'used': 4.97, 'total': 31.31}, 'gpu': {'used': 2.67, 'total': 23.69}, 'retries': 0, 'oom': 0} 15:53:55-138171 DEBUG Script postprocess-batch: Tiled Diffusion 15:53:55-138753 DEBUG Script postprocess-batch: Tiled VAE 15:53:55-139323 DEBUG Script postprocess-batch: Dynamic Thresholding (CFG Scale Fix) 15:53:55-139945 DEBUG Script postprocess-batch: Steps animation 15:53:55-140501 DEBUG Script postprocess-batch: ControlNet 15:53:55-144924 DEBUG Script postprocess-image: Tiled Diffusion 15:53:55-145532 DEBUG Script postprocess-image: Tiled VAE 15:53:55-146086 DEBUG Script postprocess-image: Dynamic Thresholding (CFG Scale Fix) 15:53:55-146702 DEBUG Script postprocess-image: Steps animation 15:53:55-147265 DEBUG Script postprocess-image: ControlNet 15:53:55-148973 DEBUG Saving image: JPEG outputs/image/00518-3294956492-.jpg (512, 512) 15:53:55-327335 DEBUG gc: cuda {'ram': {'used': 4.97, 'total': 31.31}, 'gpu': {'used': 2.67, 'total': 23.69}, 'retries': 0, 'oom': 0} 15:53:55-507236 DEBUG gc: cuda {'ram': {'used': 4.97, 'total': 31.31}, 'gpu': {'used': 2.67, 'total': 23.69}, 'retries': 0, 'oom': 0} 15:53:55-508410 DEBUG Script postprocess: Tiled Diffusion 15:53:55-508998 DEBUG Script postprocess: Tiled VAE 15:53:55-509554 DEBUG Script postprocess: Dynamic Thresholding (CFG Scale Fix) 15:53:55-510175 DEBUG Script postprocess: Steps animation 15:53:55-510803 DEBUG Script postprocess: ControlNet 15:53:55-512164 DEBUG Processed: 1 Memory: {'ram': {'used': 4.97, 'total': 31.31}, 'gpu': {'used': 2.67, 'total': 23.69}, 'retries': 0, 'oom': 0} img 15:53:55-704388 DEBUG gc: cuda {'ram': {'used': 4.97, 'total': 31.31}, 'gpu': {'used': 2.67, 'total': 23.69}, 'retries': 0, 'oom': 0} 15:53:59-708221 DEBUG Server alive: True Memory used: 4.97 total: 31.31

blacklig commented 1 year ago

Basically this part is what is being done after image stops generation and in Automatic 1111 it is being sent to client quite right after this (cannot run --debug mode there yet, maybe later, as it is kind of on production) and on yours we have this:

16:03:29-706287 DEBUG gc: cuda {'ram': {'used': 4.89, 'total': 31.31}, 'gpu': {'used': 2.53, 'total': 23.69}, 'retries': 0, 'oom': 0} 16:03:29-763195 DEBUG Server alive: True Memory used: 4.89 total: 31.31 16:03:30-325752 DEBUG gc: cuda {'ram': {'used': 4.9, 'total': 31.31}, 'gpu': {'used': 2.53, 'total': 23.69}, 'retries': 0, 'oom': 0} 16:03:30-326791 DEBUG Script postprocess-batch: Tiled Diffusion 16:03:30-327410 DEBUG Script postprocess-batch: Tiled VAE 16:03:30-327983 DEBUG Script postprocess-batch: Dynamic Thresholding (CFG Scale Fix) 16:03:30-328612 DEBUG Script postprocess-batch: Steps animation 16:03:30-329178 DEBUG Script postprocess-batch: ControlNet 16:03:30-333690 DEBUG Script postprocess-image: Tiled Diffusion 16:03:30-334312 DEBUG Script postprocess-image: Tiled VAE 16:03:30-334880 DEBUG Script postprocess-image: Dynamic Thresholding (CFG Scale Fix) 16:03:30-335518 DEBUG Script postprocess-image: Steps animation 16:03:30-336082 DEBUG Script postprocess-image: ControlNet 16:03:30-516918 DEBUG gc: cuda {'ram': {'used': 4.9, 'total': 31.31}, 'gpu': {'used': 2.53, 'total': 23.69}, 'retries': 0, 'oom': 0} 16:03:30-696881 DEBUG gc: cuda {'ram': {'used': 4.9, 'total': 31.31}, 'gpu': {'used': 2.53, 'total': 23.69}, 'retries': 0, 'oom': 0} 16:03:30-697940 DEBUG Script postprocess: Tiled Diffusion 16:03:30-698537 DEBUG Script postprocess: Tiled VAE 16:03:30-699112 DEBUG Script postprocess: Dynamic Thresholding (CFG Scale Fix) 16:03:30-700014 DEBUG Script postprocess: Steps animation 16:03:30-700867 DEBUG Script postprocess: ControlNet 16:03:30-881452 DEBUG gc: cuda {'ram': {'used': 4.9, 'total': 31.31}, 'gpu': {'used': 2.53, 'total': 23.69}, 'retries': 0, 'oom': 0}

which kinda adds up for that one second: 16:03:30-8 - 16:03:29-7 => 1.1 ... only after this output finishes only then my client gets the data from server, at least from what my eyes can see and how I feel it, might not be 100% accurate statement.

So if I break it down

kinda huge chunk is eaten right at beginning here:

+0.3s here

16:03:29-763195 DEBUG Server alive: True Memory used: 4.89 total: 31.31 16:03:30-325752 DEBUG gc: cuda {'ram': {'used': 4.9, 'total': 31.31}, 'gpu': {'used': 2.53, 'total': 23.69}, 'retries': 0, 'oom': 0}

and then also here:

+0.4s here:

16:03:30-336082 DEBUG Script postprocess-image: ControlNet 16:03:30-516918 DEBUG gc: cuda {'ram': {'used': 4.9, 'total': 31.31}, 'gpu': {'used': 2.53, 'total': 23.69}, 'retries': 0, 'oom': 0} 16:03:30-696881 DEBUG gc: cuda {'ram': {'used': 4.9, 'total': 31.31}, 'gpu': {'used': 2.53, 'total': 23.69}, 'retries': 0, 'oom': 0}

I have 2 control nets enabled in settings but that is not enabled during image generation so I did not thought it might be slowing down whole thing, seems it could.. but yet it is "only" 0.4s

and then again here almost at end:

+0.2s

16:03:30-700867 DEBUG Script postprocess: ControlNet 16:03:30-881452 DEBUG gc: cuda {'ram': {'used': 4.9, 'total': 31.31}, 'gpu': {'used': 2.53, 'total': 23.69}, 'retries': 0, 'oom': 0}

so this is basically what adds 0.9s and that last thing is what seems to me is also being processed on beginning. So maybe this together with control net somehow is adding to that time? What are those check ups? Or am I wrong here? Thanks for your quick support anyways!

vladmandic commented 1 year ago

try with UI -> Settings -> Compute Settings -> Disable Torch memory garbage collection (experimental) i'm going back and forth how often should torch gc be triggered. on my system, i'm perfectly fine with disabled. but there are soo many users running 4gb gpus that desperately need it.

btw, a1111 also does those script steps, it just doesn't report on them. those are callbacks that are auto-executed for each installed extension. the question is which extensions do you have installed in a1111? no harm in disabling controlnet (for example) here and see what's different.

vladmandic commented 1 year ago

btw, best to set sampler_name to explict Euler a for testing. UniPC is great, but it does have init/end overhead - its just better running actual steps. but if you're testing with steps=1, then that overhead can be misleading.

vladmandic commented 1 year ago

btw, you're not on latest - i can see debug output doesn't match. latest also includes additional info around prompt parsing.

this is my log with gc disabled:

overhead before generate : ~0.25
actual generate: ~0.25sec,
overhead after generate: ~0.05sec for total of 0.6sec.

i've tried disabling controlnet completely and it shaves ~0.2sec in preprocess. controlnet should not be spending that much time even if its not used, i'll raise it with author. but with controlnet disabled and gc disabled round trip is ~0.3sec - cant get much better than that.

10:15:58-962183 DEBUG    Script process: ['Tiled Diffusion', 'Tiled VAE', 'Dynamic Thresholding (CFG Scale Fix)', 'ControlNet', 'Additional networks for generating']
10:15:59-237140 DEBUG    Script before-process-batch: ['Tiled Diffusion', 'Tiled VAE', 'Dynamic Thresholding (CFG Scale Fix)', 'ControlNet', 'Additional networks for generating']
10:15:59-238115 DEBUG    Script process-batch: ['Tiled Diffusion', 'Tiled VAE', 'Dynamic Thresholding (CFG Scale Fix)', 'ControlNet', 'Additional networks for generating']
10:15:59-239847 DEBUG    Prompt schedule: [[1, 'foggy, blurry']]
10:15:59-240922 DEBUG    Prompt parse-attention: [['foggy, blurry', 1.0]]
10:15:59-263238 DEBUG    Prompt schedule: [[1, 'city at night']]
10:15:59-263995 DEBUG    Prompt parse-attention: [['city at night', 1.0]]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  7.56it/s]
10:15:59-551962 DEBUG    Script postprocess-batch: ['Tiled Diffusion', 'Tiled VAE', 'Dynamic Thresholding (CFG Scale Fix)', 'ControlNet', 'Additional networks for generating']
10:15:59-556921 DEBUG    Script postprocess-image: ['Tiled Diffusion', 'Tiled VAE', 'Dynamic Thresholding (CFG Scale Fix)', 'ControlNet', 'Additional networks for generating']
10:15:59-557709 DEBUG    Script postprocess: ['Tiled Diffusion', 'Tiled VAE', 'Dynamic Thresholding (CFG Scale Fix)', 'ControlNet', 'Additional networks for generating']

blacklig commented 1 year ago

Ok, I will try to get to your numbers. +0.3s on round trip is very nice, more when 0.2s is preprocess. I am having that 1s post proces which is really much.

With euler I had it before, did the same thing but indeed can be explicit with that as well..

How can I turn off control net with least damage? Just move it away from folders? And what is GC? That one is looking to add much time in my case as well..

Can I somehow create two instances of automatic on 1 machine? I tried that with older instance of Automatic 1111 and it ended up in big mess :) Somehow new instance still had old models even if that old instance was moved into different directory, it still somehow got messed up.. If I can easily create new instance of yours fork I might try that all in just very vanilla instance before messing up with what is already kind of running :)

vladmandic commented 1 year ago

How can I turn off control net with least damage?

don't move it from folder, its built-in so app will check for it. just go to Settings -> Extensions and deselect as enabled.

And what is GC?

When Torch is done with something, it marks memory as available, but still keeps it allocated as it may re-use it in the near future. GC is explict call to garbage collect all unused memory and return it to system. I run with GC disabled, but i have it enabled by default to avoid users creating issues.

Can I somehow create two instances of automatic on 1 machine

Sure. And you can use --models-dir to point it to models folder from other instance so they share the models.

vladmandic commented 1 year ago

anyhow, closing this issue as resolved since we got it to root cause, but feel free to post any further questions/comments.

blacklig commented 1 year ago

Did we? :) Issue still here :) mostly with that GC.. how come it is not happening in Automatic1111 probably? Cannot be torch 2.0 issue because on A1111 I have 1.3

How can I turn that GC off so I can try if I can get to smaller numbers without it? Because that might be only difference to original Automatic1111 implementation why they are that much faster so they have it off by default? Control net is for sure problem here as well, try to turn that of as well but it is sad having such big overhead just for having control net which wont be used every time but every generation has to "pay for it"

vladmandic commented 1 year ago

i said that few messages ago:

try with UI -> Settings -> Compute Settings -> Disable Torch memory garbage collection (experimental)

how come its not happening in a1111? also answered that.

i'm going back and forth how often should torch gc be triggered. on my system, i'm perfectly fine with disabled. but there are soo many users running 4gb gpus that desperately need it.

btw, it took a closer look, its both controlnet and multidiffusion, both equally responsible, each for 0.15-0.20sec. you can disable both if you're not using them.

blacklig commented 1 year ago

So, I basically disabled all extensions and mostly that garbage collector and WOW, that is speed now :D On Euler a 1 step is loke 0.14s :D impressive down from 1.3 :) and even that old A1111 which was "benchmark" is giving 0.3s

so thanks for your support, we really did dig what was the root cause and now it is working like F1 car.. But anyways, can this somehow fire back to me? Like memory will get somehow cluttered and it will crash sooner or later if this GC is turned off?

And by the way what is so bad about Euler a? On that 1 step is really faster than uni, but even on 25 steps is like 1.25s for Euler a, and 1.72 on Uni on my very primitive testing environment :) DPM++ 2M Karras 1.57.. so what is that superior about Uni that it is worth to "pay for it"? I personally like DPM++ 2M Karras most but honestly did not played much with Uni, but generated results seems to me on par so am I missing something important here?

vladmandic commented 1 year ago

Like memory will get somehow cluttered and it will crash sooner or later if this GC is turned off

There is higher risk - any memory leak will be more pronounced. Running GC frequently can "hide" those.

But memory leaks are not a general thing, it really depends on the workflow. For example, if you switch model -> use lora -> use controlnet -> switch model back -> use different lora. What did happen to memory used by first Lora? Hopefully its deallocated. Maybe its not.

If you use simple & well-defined workflows like using API without changing options, you're fine without GC.

but even on 25 steps is like 1.25s for Euler a, and 1.72 on UniPC

Nothing wrong with Euler A.

UniPC can typically get decent results in half of the steps - that's where difference comes from. Basically, it works completely differently - Euler A has "linear" steps. UniPC has init step and final step and then number of in-between steps can be much lower. (as such, UniPC will always run minimum of 3 steps, even if you set it to 1)

And second (to me important) reason is that UniPC was actively developed within this app while all other samplers are "external". So improving UniPC is possible (and there is active work on it) while improving any other sampler would require PR's to upstream repos where they come from.

blacklig commented 1 year ago

Wow, nice insights on UniPC. Will test it out, being able to reduce steps for achieving same quality is for sure good improvement.

Mostly I will use well defined work flow as you said so it will include some inpaints, some control nets etc but usually all kinda similar to get result so probably should be good. I see that with this GC off it doesnt free VRAM after job is finished but when new job starts memory immediately drops (or skips) to that necessary level so probably no issue with this so far.

Anyway - if you still want to spent time in this discussion - I noticed few for me interesting things as well.

If I change resolution for generation, first generation takes like I would say 2x much time than following. So probably some "cache" thing inside torch etc?

And also one interesting thing - for example 1024x1024 is way faster on 2.1 models than on 1.5.. that probably has to do something with 2.1 being trained on 768x768 but still interesting for me.. I would say pixels gets filled same way :) but they probably wont.. so I am probably I might even be switching to 2.1 here even though models are pretty poor in comparison to 1.5 but for my project that might not be that big of an issue

vladmandic commented 1 year ago

If I change resolution for generation, first generation takes like I would say 2x much time than following. So probably some "cache" thing inside torch etc?

not actual data cache, but execution branch cache. more specifically, there is a feature in torch that i use by default which allows torch to try couple of different things and then determine which is optimal path. at the end result is nearly the same, but it actually greatly increases compatibility for some ops for cards which would normally fail using default path.

when you change resolution, that branch cache is invalidated so needs to be recalculated. same if you change batch size for example.

blacklig commented 1 year ago

wow. At first I thought your for is just "different UI" and bit more polished dependencies but now it seems like you are kind of rewriting it from scratch :)

Anyways, from your knowledge, how efficient is Torch? Seen somewhere on the reddit opinion that it is utilising only about lets say 50% of what could be done with current GPUs but I really cannot tell if yes or not. Although Python is not any super performant, if I am not mistaken, Torch is written in C or C++ and has direct access to all low level GPU stuff and so on, so I am wondering whether we can get somewhere closer to max performance or if we are already there and not much can be done further instead of switching to newer GPU architectures etc..

vladmandic commented 1 year ago

torch is very efficient, but...it's only as efficient as its backend. torch is generates cuda code and then executes it on the desired backend. and only backend it has is cpu.

for the rest, think of torch as middleman.

torch with actual cuda will use cudnn to compile functions to optimal cuda code. and then use cuda to talk to gpu.
torch with rocm will use rocm to talk to gpu, but there is nothing to perform "ideal" compile, so it will use ops as-is.
basically rocm is pretending to be cuda without any deep optimizations.
torch with directml will take generated cuda code and then cross-translate it to directx, so you loose a lot.

now, cuda is old & stable general purpose gpu compute. which means it will use vector and shader pipelines of your gpu to their full potential. you're not leaving anything on the table here.

but new nvidia gpus (rtx 3000 and above) also have tensor pipeline (that's what's used in games for dlss and raytracing for example). and those pipelines are not used at all by cuda.

instead nvidia has its own framework - tensorrt. its possible to recompile entire sd to work on tensorrt and yes, its twice as fast. but its sooo much work and each model has to be re-compiled and nothing else works.

instead of tensorrt being a completely separate backend, i wish nvidia released tensorrt as an additional backend for cuda.

blacklig commented 1 year ago

Wow, where did you get such deep knowledge of all this low level stuff? :) Would love to understand it more as well.

That is really shocking for me that CUDA cannot utilise tensor cores. So does it mean that basically all what matters performance wise are shaders, vectors and gpu clock speed + ram size and ram speed? Not all those new features of new cards, tensor cores etc? Bud does nvidia really doubles even those "old" pipelines like vectors, shaders etc every generation? Because in between generations there is always like douboe the performance in SD, so where does this come from?

About that tensorrt it is very interesting to hear that. Do you think for example midjourney od stability ai etc has their own models running on this new backend utilising this?

And how do AMD cards stands if they even do not support cuda? Do they have some hypothetical chance of some universal framework being developed from which would nvidia and amd cards benefit for example?

And thanks for very interesting insights!

vladmandic commented 1 year ago

Bud does nvidia really doubles even those "old" pipelines

Yes, it does - if you think about it, game developers cannot rely on tensor cores since majority of gamers still don't have them. so tensor cores are mostly tasked with "secondary" things like ray-tracing and dlss (v2 and v3).

Do you think for example midjourney od stability ai etc has their own models running on this new backend utilising this?

Absolutely. Its already possible to compile SD to run on TensorRT. The problem is that ecosystem falls apart - forget about extensions, Loras, different fine-tuned models, etc. But I was running a hosted solution based on very few models and no such extras, for sure I'd optimize for TensorRT.

And how do AMD cards stands if they even do not support cuda?

unfortunately, ROCm development in AMD seems almost like an after-thought while CUDA development in nVidia engineering is a priority. Until AMD changes their internal priorities and starts considering ROCm a really important part of their ecosystem, it's always going to lag behind. But right now, AMD is primarily just a HW manufacturer. Also, AMD does not have datacenter presence - a lot of CUDA work done by nVidia is to support nVidia in datacenters and it only trickles down to general public.

blacklig commented 1 year ago

Your knowledge is really awesome.

I see your point about recompiling whole ecosystem. That is really something that would be quite difficult to overcome. like right now it is even problem to switch into 2.1 as everything is for 1.5 even though 2.1 is better because of higher res and if all those awesome models would get trained also for 2.1 it would be nice, but probably it is not going to happen any soon I think.

How difficult would it be for someone like you to recompile base model and those diffusers and basically wrap it into this web UI etc to fully utilise newest architectures? And what would be needed to recompile some models? Like would you need all training datas as was used when model was trained and basically rerun the proces again just on newer version which can utilise those RTs etc? Or is there any chance that old models would get for example somehow wrapped for some back compatibility and new models compiled in new versions would beat them but we still would be able to use older ones with low performance etc..

By the way, where did you get such deep knowledge of all this stuff? Is it as you are digging into this project so you actually decode what all those parameters do etc or you knew this before? Like if I am sending those API calls, I know only about like half the parameters in payload and rest is like "i kinda know but no, I dont know for sure how do they influence output" etc :)

blacklig commented 1 year ago

Anyways, I turned back control net (2 networks) and GC is still off and even now it is not adding any significant overhead in post, so wow..

100%|█████████████████████████████████████████████| 1/1 [00:00<00:00, 34.58it/s] 09:59:52-062457 DEBUG Script postprocess-batch: Steps animation 09:59:52-063303 DEBUG Script postprocess-batch: ControlNet 09:59:52-067844 DEBUG Script postprocess-image: Steps animation 09:59:52-068499 DEBUG Script postprocess-image: ControlNet 09:59:52-069175 DEBUG Script postprocess: Steps animation 09:59:52-069757 DEBUG Script postprocess: ControlNet

really like 5 ms max.. so I am super happy about this outcome so I can have some extensions loaded in my instance and yet it is super duper fast if I am not using them :)

vladmandic commented 1 year ago

How difficult would it be for someone like you to recompile base model and those diffusers and basically wrap it into this web UI etc to fully utilise newest architectures

Stuff like that is why I started this project. I'm hoping to get to it when issues quiet down.

By the way, where did you get such deep knowledge of all this stuff?

Hard to say, I follow a lot of things for a long time - so when I decide to do a deep-dive into something, its much quicker as general concepts are already known - things just "click". Btw, I haven't used torch and I haven't written a line of python code until ~6 months ago. To me they are tools, nothing more, nothing less. I don't like when ppl overly specialize so switching tools is a nightmare as concepts are different.

blacklig commented 1 year ago

Wow, happy to hear your future plans, cant wait for it. And inspiring to see your workflow!

I would like to leave this thread as it drifted away from initial issue so thanks again for your prompt help and hope to hear from you soon again, maybe in another thread :)

But is there for example place where I could ask more general questions about your awesome project? For example what I am struggling with now is how to "link" properties from UI to appropriate API calls.. :) Like for example I have this settings in UI:

but how do I know what params to set in API calls? They are totally not the same but I think it would be nice if they would get somehow standardised. like API payload is having things like this:

    "mask_blur": 8,
    "inpainting_fill": 1,
    "inpaint_full_res": false,
    "inpaint_full_res_padding": 32,
    "inpainting_mask_invert": 0,
    "initial_noise_multiplier": 0,

but those are generally different names. Like for example on UI "Inpaint area - Whole picture/Only masked" - how do I know what is its counterpart in API payload? :) How do I generally gain understanding of those?

Unfortunately UI is using web sockets to send those payloads which have also totally different structure than JSON api payloads, so cannot get much help from that as well.

But this is more general question which might have value for another people as well so if you can point me where I could raise such questions? Like here in issues with different tag? Like "help" ? :D Or github totally not for those?

Thanks :)

vladmandic commented 1 year ago

why not use github discussions for that? regarding api, yeah, its a nightmare. i have plans for v2 api, but again its going to take a while - too many things on my plate.

i've just added a feature, run webui --debug and it will print params as the function receives them from ui.

07:41:22-715599 DEBUG txt2img: id_task=task(mgkd6b79dlccpxd)|prompt=|negative_prompt=|prompt_styles='Default']|steps=20|sampler_index=6|restore_faces=False|tiling=False|n_iter=1|batch_size=1|cfg_scale=6|seed=3344295506.0|subseed=-1.0|subseed_strength=0|seed_resize_from_h=0|seed_resize_from_w=0|seed_enable_extras=False|height=512|width=512|enable_hr=False|denoising_strength=0.7|hr_scale=2|hr_upscaler=Latent|hr_second_pass_steps=0|hr_resize_x=0|hr_resize_y=0|override_settings_texts=[]

then you can use them in api.

blacklig commented 1 year ago

i've just added a feature, run webui --debug and it will print params as the function receives them from ui.

Wow, that is soooo useful! Thanks for that!

blacklig commented 1 year ago

So now with your super duper function seeing what params are going into SD, I discovered that img2img takes for example 3x some image input, like image, mask but also "init_img_with_mask" which I dont know what is.

Do you have more info about it or how could I crack what is inside, as I dont see into those PIL objects.. and regular API doesnt say anything about this input... For example this is what I am getting from by Web UI..:

id_task=task(l65t5uc4msneeyy)|mode=2|prompt=|negative_prompt=|prompt_styles=[]|init_img=<PIL.Image .Image image mode=RGBA size=896x768 at 0x7F83367AD510>|sketch=None|init_img_with_mask={'image': <PIL.Image.Image image mode=RGBA size=414x552 at 0x7F83367AFEE0>, 'mask': <PIL.PngImagePlugin.PngImageFile image mode=RGBA size=414x552 at 0x7F83367AEFB0>}|inpaint_color_sketch=None|inpaint_color_sketch_orig=None|init_img_inpaint=None|in it_mask_inpaint=None|steps=15|sampler_index=0|mask_blur=5|mask_alpha=0|inpainting_fill=0|restore_f aces=False|tiling=False|n_iter=1|batch_size=8|cfg_scale=5|image_cfg_scale=0|denoising_strength=0.6 9|seed=-1.0|subseed-1.0|subseed_strength=1|seed_resize_from_h=0|seed_resize_from_w=0|seed_enable_extras=False |selected_scale_tab=0|height=1024|width=768|scale_by=1|resize_mode=0|inpaint_full_res=0 |inpaint_full_res_padding=64|inpainting_mask_invert=0|img2img_batch_input_dir=|img2img_batch_outpu t_dir=|img2img_batch_inpaint_mask_dir=|override_settings_texts=[]

Although I am doing my best to send exactly those same images via API, I am not getting same results as from web UI and dont know why.. only what I am thinking of must be that init_img_with_mask parameter which I am not passing via API because I dont know what could be there..

vladmandic commented 1 year ago

honestly, i have no idea, i never used those (yet)

blacklig commented 1 year ago

Sure, no problem :)

One small thing - as you added that debug that shows settings that goes from Web UI, it works well for those tasks like img2img without any extra networks, but once for example control net is added, it doesnt show any more settings regarding to control net. It would be also very useful to see all those settings from another networks etc. Usually in JSON payload they are like whole new objects under I think "alwayson_scripts" which is part of that JSON payload.

/docs mentions this but only that "alwayson_scripts" and nothing more about it so I had to quite hack it to find out it probably should be sent. Anyways, if you can see something like that being passed and could show it all it would be great :)

This is part of normal payload to img2img for example activating control net, although this is small payload, no extra params to it, just basic sample:

"alwayson_scripts": { "controlnet": { "args": [ { "module": "canny", "model": "control_v11p_sd15_canny [d14c016b]" } ] } }

vladmandic commented 1 year ago

i've added those now, but you'll hate how it looks as args list is a long flat list and each extension registers from-index and to-index for its parameters.

blacklig commented 1 year ago

Nice, thanks a lot, but how do we actually use it?

I tried it and this is what I got after just basic depth map from control net:

so like above parameters was there even before, now I see some new stuff like "<controlnet.py.UiControlNetUnit object at 0x7f0041d78880>, <controlnet.py.UiControlNetUnit object at 0x7f0041d59f60>, False, False, 'positive', 'comma', 0, False, False, '', 1, '', [], 0, '', [], 0, '', [], True, False, False, False, 0, None, False, None, False, 50)"

but that I cannot really decode and know what parameter is what.. like that first part is also giving names of variables in json payload so it helps a lot what to send but with this I still dont know for example how to adjust for example those:

not even knowing what are default values sent if nothing is set as this model works nicely in UI but cannot get it run properly via API.. but only this "depth_leres". If I use just "depth" via api, it works OK... (interestingly, there is no "depth" model in my models list in UI :))

but is available via API get modules..

so not rally sure why I cannot chose it from web UI and what exactly is this preprocessor module though

vladmandic commented 1 year ago

using scripts/extensions via existing api is nightmare, i don't think i can shed more light than what i already have. eventually, i'll create v2 api which should be actually user friendly.

blacklig commented 1 year ago

hehe, alright :)

I was even thinking is using API actually necessary? Or is it much pain in ... to try to go directly to diffusers? Like that web UI is using websockets instead of api, so use those for example? Or is that even worse idea? :) There is for sure not any docs, or at least nothing I would know of..

vladmandic commented 1 year ago

its the same thing - ui populates that flat list.

blacklig commented 1 year ago

seems to me web UI is using web sockets which are communicating with backend using different format of JSON. Like most of the params goes in some "data" array where at least for me I dont know what all those values refer to. but I could at least reverse engineer them to see what they refer to and actually make proper calls .. but it looks like way more hacking that I would like to do :)

vladmandic commented 1 year ago

transport layer is irrelevant - at the end its still a FLAT LIST.

blacklig commented 1 year ago

yeah..

but at the end of the day someone has to know what are all those params. Like if authors of control net release new model and you want to implement it, then you should know what parameters this model expects so you can pass them from web ui, dont you? So in such case dev has to tell you or where would you refer to? At least for control net, their git page is not telling much about those as well.. and like 3 weeks ago they released new tiled model, dont know if you already implemented it somehow or not, but that would be the case, isnt it?

Michaelvirga commented 1 year ago

Does A1111 have a similar setting to disable torch garbage collection? I know this is for the vlad repo, but I'm seeing wildly different performance for controlNET when calling the API directly vs using the A1111 UI itself.

vladmandic commented 1 year ago

Does A1111 have a similar setting to disable torch garbage collection? I know this is for the vlad repo, but I'm seeing wildly different performance for controlNET when calling the API directly vs using the A1111 UI itself.

nope.

vladmandic / automatic

[Issue]: Very slow API response in comparison to Automatic1111 #1000

Issue Description

Version Platform Description