RTX3060 user here. Which is best for me?

Hello Ikmal, thank you for the kind words!

To answer your specific situation: if your 3060 has at least 12 Gb of VRAM (I believe they do in general, but I'm not positive they all do) and you want to get the absolute fastest performance you can and will sit through the 15-20 minutes it takes to compile each engine, then go for the TensorRT package. If you don't want to commit to a style like that and would rather swap around models often, then go for the CUDA package.

SDXL isn't enabled for TensorRT at all yet, so when you're using the TensorRT package and switch to XL, you will simply go back to normal torch inference. For reasons that you may or may not care to read about, XL will be faster in the CUDA package than in the TensorRT package, so if you plan to use XL very frequently, then you probably also want the CUDA package.

All the same models (.ckpt. .safetensors, .pt, etc.) are used in both versions. However, TensorRT has specific .plan files - these are what takes so long to compile. They are remarkably specific to your hardware and I'm not aware of anyone that distributes pre-compiled versions - I'm not sure if we will ever be able to distribute them, similar to how we don't distribute compiled shaders in games and make the user compile them on their hardware.

Now, to answer the longer bit about why - the TensorRT support is a bit of a red herring, the primary difference between the two packages is the version of CUDA: 11.7 for the TensorRT package, and 12.1 for the non-TensorRT package.

The reason for that is complicated. Wrangling CUDA versions on different architectures is a many-headed hydra, code compiled for later CUDA versions is generally backwards-compatible with previous versions, but when you're trying to push the boundaries of performance (as TensorRT is trying to do,) hardware optimizations start to be very important. Some of that optimization is done automatically, but some must be implemented manually by the programmer, and I only have access to so many different kinds of hardware and time to tune for them that I had to make some limiting choices just to be able to ship a product at all that would yield maximum compatibility.

So I tuned for 11.7, which would work very well in the 3000-series desktop cards and A100's for cloud machines. 4000-series and better cloud machines would work well, too, though not quite as well as they could have if I could spend time tuning for them. The thing is, though, that for all of my tuning on 11.7, if you simply upgrade your environment to 12.1, it becomes all for naught and TensorRT no longer gives the performance boost it did before. I am certain there is a way to overcome this, but TensorRT 9.0 is right around the corner - when that is released, I will revisit this issue and see if I can consolidate for a single release. I know they have put a significant amount of time into Stable Diffusion specifically, so it may be much improved by then.

If you aren't using TensorRT, then I don't need to do all this manual tuning myself, and I can let more automatic tuning take place. I can also switch to the frontier version of Torch (2.2) which comes with significant speed boosts as well, so all in all it made sense at launch to freeze one package at 11.7 with TensorRT support, and let another package get the latest and greatest CUDA and Torch had to offer.

Sorry, that post got really long! It's a complicated issues that I don't think there is one elegant solution for, at least not right now. I know Nvidia is working on tools for programmers to be able to better manage these dependencies, but it's clear the focus there is on enterprise instead of home use, so we may need to rely on community benchmarks and knowledge transfer to be able to make one solution that works for everyone.

painebenjamin / app.enfugue.ai

RTX3060 user here. Which is best for me? #87