t0rakka / mango

Other
59 stars 10 forks source link

math::maskToInt does not gracefully fallback if AVX512 is unavailable #11

Closed tomysshadow closed 2 hours ago

tomysshadow commented 5 hours ago

Throughout this project, I have seen that there are CPUID checks in place in order to gracefully fallback if a particular instruction set is not available. For example, in jpeg_encode.cpp, there are snippets of code like the following:

#if defined(MANGO_ENABLE_AVX512)
        if (flags & INTEL_AVX512BW)
        {
            encode = encode_block_avx512bw;
            encode_name = "AVX512BW";
        }
#endif

With that said - when I attempt to construct an ImageDecoder object, it calls math::maskToInt and I get an illegal instruction error.

AVX512

The code that triggers this:

mango::ConstMemory constMemory(pointer, (size_t)inputFileSize);

// is there a decoder for this image format available?
mango::image::ImageDecoder imageDecoder(constMemory, "png");

if (!imageDecoder.isDecoder()) {
    return false;
}

Call stack:

>   ImageTest.exe!mango::simd::compare_eq(mango::simd::hardware_vector<unsigned char,16,__m128i> a, mango::simd::hardware_vector<unsigned char,16,__m128i> b) Line 222  C++
    ImageTest.exe!mango::math::operator==(mango::math::Vector<unsigned char,16> a, mango::math::Vector<unsigned char,16> b) Line 140    C++
    ImageTest.exe!`anonymous namespace'::getImageFormatExtension(mango::detail::Memory<unsigned char const> memory) Line 78 C++
    ImageTest.exe!mango::image::ImageDecoder::ImageDecoder(mango::detail::Memory<unsigned char const> memory, const std::string & filename) Line 273    C++

I obviously understand what has happened: in CMake I created the project with ENABLE_AVX512, my CPU does not support AVX512 and so it runs an illegal instruction. Indeed, if I create the project without ENABLE_AVX512, the issue goes away - for me, at least. However, the existence of flag checks elsewhere in the code leads me to believe that the library is intended to work regardless of the CPU's feature set and to gracefully fallback to whichever is the best available (because otherwise, why would it ever bother checking if AVX512 is supported?) So I would imagine that there is meant to be a call to getCPUFlags() somewhere before this point that is not happening, probably near the MATH_SIMD_COMPARE_FUNCTIONS macro (which I am not familiar enough with the codebase to know how that should change.) In which case, this is probably a bug.

If on the other hand I am to assume this is actually intended behaviour, is there any mechanism by which I could ensure that the library does gracefully fallback to a supported instruction set?

This error occurs when compiling for Debug x64, on Windows 10 Version 22H2, with an AMD Ryzen 7 3700X.

t0rakka commented 4 hours ago

This is a bit tough nut; it's working "as intended", when you enable AVX-512 for example it means the compiler WILL use the ISA, even if I guard agains it clang and gcc are free to generate AVX-512 instructions AT ANY TIME ANYWHERE. Visual C++ doesn't vectorize as aggressively, it used to not use the instructions much if at all unless use the intrinsics.

For example, the simd namespace picks up the SIMD ISA from core/configure.hpp where there is various logic for different compilers how the simd instructions are selected. MSVC and GCC/CLANG have slightly different logic, for example clang/gcc request avx with -mavx compiler flag, then it sets the AVX macro, which the configure.hpp detects and determines that these instructions are alright to generate in the simd code.

Some of the codecs are more hardened and check the feature flags (cpuid) when the code generation is enabled, but in broader sense if you enable AVX it means you need AVX in runtime. It's just checked more rigorously in the image codecs but still can't block illegal instruction if the CPU doesn't have the feature in runtime.

So we come back to the beginning: it's a tough nut. Best practice is to determine minimum requirements and enable those in the build environment to have a working base. Then you unfortunately must have specific build for any enhancements, that sucks I know but it's really difficult situation to handle in portable way with different tool-chains (compilers) doing their own things.

One way is to have a loader that dynamically loads the implementation for various architectures.

This is a bit awkward way to do things but is the way that is bullet-proof if you want multi-isa binaries that just work. Think it this way: once you compile something with a feature flag enabled, it pollutes the whole translation unit. There is no way I can guarantee that if you enable AVX-512, these instructions are NOT generated by (some) compiler.

What you want (must?) do is have the cpuid feature check and THEN load anything that has been compiled with the feature if you want to do this in runtime. Nothing else can work deterministically. If you enable something, the CPU must have it. Early feature check and exit should work more or less, before C++ runtime calls your main function all that code should be safe no matter what flags you used as it comes from pre-compiled runtime libs.

One thing that might make programming life easier is to decorate the library names with the specific ISAs that were used to compile them: libmango-avx.so/dll, and so on but that needs to be decided. If you want to be a test subject we can make a branch where we try out different strategies to improve the quality of programming life. :D

t0rakka commented 4 hours ago

One thing that is a decent baseline is 64 bit x86 and aarch64. The NEON64 is always available in aarch64, and SSE2 is always available in x86_64 so 64 bit ARM and INTEL have a fairly decent SIMD as default that works everywhere.

Then you can enable more fancy extensions with more complicated means. The AVX gives a bit of extra boost as it doubles the number of available registers and makes them wider, but on downside the widened registers are split into low and high halves which don't communicate to each other very flexibly.. it's like they glued two 128 bit registers together. In that sense AVX isn't that amazing but it compiles into binaries that spill less so that's always a nice thing.

The biggest bang for buck is moving from SSE2 to SSE 4.1 as that makes some really nice instructions available that we actually take advantage of in the JPEG and PNG codecs. I feel that SSE 4.1 is a pretty decent baseline as it's been available since 2006, almost two decades I think that is pretty safe. A steam survey few years ago had the availability at nearly 100% so I feel it's quite comfortable baseline and brings most bang for the buck.

AVX is luxury, AVX2 is overhyped. AVX-512 is super flexible but only becoming available mainstream on AMD, I used i9 10-core CPU back in 2017 to write the support, it was consumer version of Xeon CPUs. AVX-512 is absolutely joy to program with but didn't gain traction like I thought it would.. Intel still to date doesn't support it in their consumer CPUs.. they went with the big-little design instead which meant they couldn't include it. If the simple cores don't have the instructions and thread migrates from performance core to simple core we would immediately got illegal instruction exception. They could have had trap and force the thread to performance core but chose to disable the whole thing instead. Anyways, long story short: 64 bit is already comfy target, and if you can the SSE4.1 is a great minimum requirement.

I must say the 32 bit support is a bit spotty, think of memory mapping 8 GB file how that going to work with 32 bit CPU? Largest continuous address space you going to get won't be more than 2 GB in practice, if even that. So the mango::filesystem::File is completely useless unless read smaller files like < 200 MB or so. You want to use the InputFileStream and seek+read, read what you need into buffer then, say, use the ImageDecoder interface on that buffer. I could have written the "container" system (zip/rar/etc abstraction) to also be streamable so that could use it with streaming files as well but it would be work for.. how would I say it.. obsolete systems.. I think 64 bit is way forward and the limitations of 32 bit address space can be worked around in most practical use cases.

t0rakka commented 3 hours ago

So background story sorted out, the answer to the question is that when compiled for AVX512 the SIMD code will use AVX512; this can't be checked for each simd intrinsic function call as it would defeat the purpose (be extremely slow) and then the compiler would screw us anyway when it's enabled by littering these instructions where it wants; memcpy could be using avx512! =(

This is where JIT compiled languages have advantage over c++.. the code is generated (and cached) in runtime, the JIT compiler uses what the CPU has. I added WASM SIMD (WebAssembly SIMD) back-end in the past week, what a timing.. so when using Emscripten to compile mango it generates simd code, that's cool. The not-so-cool part is that the atomic support is a bit sketchy with threading with Emscripten so with one version it works, then might break on next update and work again half a year later. Either way, added the support because it felt like something fun to do.

tomysshadow commented 2 hours ago

I am perfectly fine with the idea of having multiple versions of my DLL for each different feature set. However, from my perspective, it would be really nice then if mango::getCPUFlags() lived in a separate lib (similar to mango-import3d/mango-opengl,) because it feels a bit redundant to need to rewrite or bring on another library in order to detect the available features, but I obviously can't use that function as-is (because then the rest of the library must be loaded along with it already, when I want to use it to determine what version to load.) It'd be nice if I were able to call getCPUFlags in a "mango-cpuinfo.lib" (compiled without enhanced instruction set) and use that to load the appropriate version of my DLL (compiled with mango, with enhanced instructions.) And then obviously this other lib could also be referenced by and used by the main mango.lib itself. Would it be possible to structure this way?

t0rakka commented 2 hours ago

It would be slightly complicated solution and there is one target that has dictated a lot of the design, mainly, macOS.. the xcode .framework files are Big Fat Libraries and that has a bit forced my hand in design; on Xcode you can drag and drop the mango.framework file into your custom project and all libraries and headers magically become available.

I think it's reasonable to just copy-paste or have standalone version of the file with all extra crap removed for this special use case. If I figure out a clean solution meanwhile I will definitely take necessary actions.

t0rakka commented 2 hours ago

Speaking of macOS being annoying, they just upgraded their command line tools and code stopped compiling, committed a fix literally 5 minutes ago. APPLE <3 !

tomysshadow commented 2 hours ago

Alright. Well, I'll just figure something out then and see how it goes, thanks for the advice

t0rakka commented 2 hours ago

I'd drop the macOS support already but their M1..M2.. etc. laptops are just the best, can't help using them.. they are THE greatest ARM development box ever, used to have Raspberry Pi's, odroid's, even Android phones where I install my stuff as APK's just to get some ARM action. Apple Silicon laptops just make it so easy and fun that I can't resist.. the downside is that get a bit unrealistic view of ARM performance (they are competitive with 64 bit x86!). ;) <-- they were overhyped at launch the reality is they just catched up with Intel, but with less power so finally can do something with arm.. they used to be a bit rubbish to be honest..

t0rakka commented 2 hours ago

sorry about the opinionated piece.. anyways, we're not ruling out something for the architecture/isa/cpuid mess.. you got me thinking about this again so that's good